← Back to main docs

Usage Guide

This guide focuses on the workflows that are stable in the current codebase:

  • profile a bounded section of code

  • track memory over time

  • export telemetry and plots

  • move from runtime data to a diagnose bundle or TUI session

If you want task-oriented production recipes instead of a capability guide, use the Production Cookbook, especially Always-on Tracking, PyTorch Production Recipes, and TensorFlow Production Recipes.

Install the distribution as stormlog, then import the Python APIs from stormlog, stormlog.tensorflow, or stormlog.jax. The CLI automation commands are gpumemprof, tfmemprof, and jaxmemprof.

Choose the right tool

GPUMemoryProfiler

Use when:

  • you have a PyTorch runtime exposed through torch.cuda (CUDA or ROCm)

  • you want per-call or per-context profiling

  • you care about allocated vs reserved GPU memory during a bounded operation

TFMemoryProfiler

Use when:

  • you are profiling TensorFlow code directly

  • you want snapshots plus aggregated TensorFlow profiling results

JAXMemoryProfiler

Use when:

  • you are profiling JAX code directly

  • you want memory usage insights taking XLA compilation into account

  • you need snapshots of JAX device allocations over time

MemoryTracker

Use when:

  • you need telemetry over time rather than one profiled call

  • you want backend-aware tracking on CUDA, ROCm, MPS, or CPU fallback paths

  • you want alerts, exported event streams, or diagnose bundles

CPUMemoryProfiler / CPUMemoryTracker

Use when:

  • you are on a CPU-only machine

  • you want the same workflow shape without CUDA-specific profiling

Canonical workflow

For a new environment or a new project, use this sequence:

gpumemprof info
gpumemprof track --duration 2 --interval 0.5 --output /tmp/gpumemprof_track.json --format json
gpumemprof analyze /tmp/gpumemprof_track.json --format txt --output /tmp/gpumemprof_analysis.txt
gpumemprof diagnose --duration 0 --output /tmp/gpumemprof_diag

tfmemprof info
tfmemprof diagnose --duration 0 --output /tmp/tf_diag

jaxmemprof info
jaxmemprof track --duration 2 --interval 0.5 --output /tmp/jaxmemprof_track.json --format json
jaxmemprof analyze /tmp/jaxmemprof_track.json --format txt --output /tmp/jaxmemprof_analysis.txt
jaxmemprof diagnose --duration 0 --output /tmp/jax_diag

This gives you:

  • environment visibility

  • a short telemetry sample

  • a readable analysis artifact

  • a portable diagnose bundle you can load later

For long-running captures, Stormlog now treats every track run or standalone diagnose bundle as its own session. That makes it possible to reuse the same sink directory across runs and still reconstruct or audit one capture at a time.

If you are working from a source checkout, you can optionally add the maintained example smoke paths:

python -m examples.cli.quickstart
python -m examples.cli.capability_matrix --mode smoke --target both --oom-mode simulated

The examples/ package is not included in the PyPI distribution.

PyTorch profiling

GPUMemoryProfiler currently targets torch.cuda runtimes. That includes NVIDIA CUDA builds and ROCm-backed PyTorch builds surfaced through torch.cuda. If you are on Apple MPS or CPU-only hardware, move to MemoryTracker, the CLI, or the CPU-only path below.

import torch
from stormlog import GPUMemoryProfiler

profiler = GPUMemoryProfiler(track_tensors=True)
device = profiler.device
model = torch.nn.Linear(1024, 256).to(device)

def train_step() -> torch.Tensor:
    inputs = torch.randn(64, 1024, device=device)
    outputs = model(inputs)
    return outputs.sum()

profile = profiler.profile_function(train_step)
summary = profiler.get_summary()

print(profile.function_name)
print(f"Peak memory: {summary['peak_memory_usage'] / (1024**3):.2f} GB")

Context profiling

import torch
from stormlog import GPUMemoryProfiler

profiler = GPUMemoryProfiler()
device = profiler.device

with profiler.profile_context("forward_pass"):
    x = torch.randn(32, 1024, device=device)
    y = torch.nn.Linear(1024, 128).to(device)(x)

TensorFlow profiling

from stormlog.tensorflow import TFMemoryProfiler

profiler = TFMemoryProfiler(enable_tensor_tracking=True)

with profiler.profile_context("training"):
    model.fit(x_train, y_train, epochs=1, batch_size=32)

results = profiler.get_results()
print(f"Peak memory: {results.peak_memory_mb:.2f} MB")
print(f"Snapshots captured: {len(results.snapshots)}")

For CPU-only TensorFlow or when the GPU backend is unavailable, initialize the profiler with TFMemoryProfiler(device="/CPU:0").

JAX profiling

import jax.numpy as jnp
from stormlog.jax import JAXMemoryProfiler

profiler = JAXMemoryProfiler()

with profiler.profile_context("training"):
    x = jnp.ones((1000, 1000))
    y = jnp.dot(x, x)
    # Block to ensure accurate profile boundaries due to async dispatch
    y.block_until_ready()

results = profiler.get_results()
print(f"Peak memory: {results.peak_memory_mb:.2f} MB")
print(f"Snapshots captured: {len(results.snapshots)}")

For CPU-only JAX execution, the profiler will automatically adapt to the cpu backend.

JAX OOM Flight Recorder and Graph Visualization

When tracking JAX execution over time, you can enable the OOM flight recorder to capture detailed XLA device memory profiles (.prof files) upon an Out-Of-Memory crash.

Stormlog automatically parses this dump and generates a completely standalone, interactive WebAssembly graph viewer (jax-device-memory-graph.html) right next to the .prof file in the oom_dumps directory.

You can view the resulting diagnostic graph in two ways:

  1. Dependency-free HTML: Simply open the jax-device-memory-graph.html file in your web browser to see an interactive Directed Graph of the call stack (no Go, protoc, or Graphviz installation required).

  2. Official Go Tool: If you prefer the standard Go toolchain, you can run go tool pprof -http=:8080 <path_to_.prof>.

CPU-only workflow

Use this when PyTorch CUDA profiling is unavailable but you still want a local validation path:

from stormlog import CPUMemoryProfiler

profiler = CPUMemoryProfiler()

with profiler.profile_context("cpu_step"):
    values = [i * i for i in range(100_000)]
    values.reverse()

summary = profiler.get_summary()
print(summary["mode"])
print(summary["snapshots_collected"])

Tracking over time

PyTorch tracker

MemoryTracker follows the PyTorch tracker stack. Install stormlog[torch] before using this example. On Apple Silicon, this is the MPS-aware tracker path once the PyTorch extra is installed.

from stormlog import MemoryTracker

tracker = MemoryTracker(
    sampling_interval=0.5,
    enable_alerts=True,
)

tracker.start_tracking()
# run workload here
tracker.stop_tracking()

stats = tracker.get_statistics()
print(f"Peak memory: {stats.get('peak_memory', 0) / (1024**3):.2f} GB")
print(f"Events: {stats.get('total_events', 0)}")

Structured workload phases

All long-running trackers can emit explicit workload-phase boundaries while tracking is active:

from stormlog import CPUMemoryTracker

tracker = CPUMemoryTracker(sampling_interval=0.1)
tracker.start_tracking()

with tracker.phase("train", metadata={"epoch": 1}):
    with tracker.phase("forward", metadata={"microbatch": 4}):
        _ = [i * i for i in range(50_000)]

tracker.stop_tracking()

This writes phase_enter and phase_exit telemetry records with structured metadata["phase_scope"] payloads. The same API shape is available on MemoryTracker, CPUMemoryTracker, stormlog.tensorflow.MemoryTracker, and stormlog.jax.MemoryTracker.

When those phase boundaries are present, gpumemprof analyze and the TUI Diagnostics tab can attach hidden-memory gaps, collective spikes, and cross-rank first-cause suspects back to the active phase path.

If the underlying collector becomes unstable, MemoryTracker keeps the tracking session alive, exposes collector health through get_statistics(), and emits collector_degraded / collector_recovered events instead of exporting synthetic zero-valued samples.

The tracker session lifecycle is:

  • session begins after tracker startup succeeds and before the first persisted record

  • clean stop_tracking() finalization marks the session completed

  • recovered append-only runs are marked interrupted

  • partial artifacts without provable shutdown remain incomplete

For always-on deployments, treat the tracker as an operating-budgeted service:

  • keep the default sink retention enabled so artifacts stay bounded

  • watch rollover_count, pruned_segment_count, and final_retained_*

  • watch history_dropped_* to confirm bounded in-memory windows are evicting as expected

  • treat any non-zero collector_failure_event_count or non-healthy collector state as actionable

The maintained source-checkout qualification path for those guarantees is the v0.4 operability harness in examples/cli/benchmark_harness.py.

For CUDA-only OOM debugging, enable native allocator-history capture and wrap the risky block with capture_oom():

from stormlog import MemoryTracker

tracker = MemoryTracker(
    enable_oom_flight_recorder=True,
    enable_native_cuda_history=True,
)

with tracker.capture_oom(context="train-step"):
    run_training_step()

If that block raises a CUDA OOM, the OOM dump bundle is extended with native snapshot artifacts and pointer-attribution summaries. The OOM manifest and metadata also include the owning session_id, so the dump can be traced back to the exact tracking run that produced it.

CPU tracker

from stormlog import CPUMemoryTracker

tracker = CPUMemoryTracker(sampling_interval=0.5)
tracker.start_tracking()
# run workload here
tracker.stop_tracking()
stats = tracker.get_statistics()
print(stats["total_events"])
print(stats["final_retained_files"])
print(stats["history_dropped_events"])

Reusing a sink directory across runs

You can keep a long-lived sink directory and still separate captures deterministically:

gpumemprof track --telemetry-sink-dir ./live_sink --duration 30
gpumemprof track --telemetry-sink-dir ./live_sink --duration 30
gpumemprof analyze ./live_sink
gpumemprof analyze ./live_sink --session-id "<session-id-from-report>"

The default loader behavior is:

  1. newest completed

  2. newest interrupted

  3. newest incomplete

Use explicit session selection when you want an older or partial run.

Plot exports

MemoryVisualizer is for saved plots and exported data once you already have profiler results or monitoring snapshots.

import torch
from stormlog import GPUMemoryProfiler, MemoryVisualizer

profiler = GPUMemoryProfiler()
device = profiler.device


def sample_workload() -> torch.Tensor:
    x = torch.randn(32, 64, device=device)
    return x.sum()

profile = profiler.profile_function(sample_workload)
visualizer = MemoryVisualizer(profiler)
visualizer.plot_memory_timeline(interactive=False, save_path="timeline.png")

Install stormlog[viz] before relying on PNG or HTML exports.

When to switch to the TUI

Use the TUI when you want:

  • a live monitoring session without writing custom code

  • quick CSV/JSON export from an active tracker

  • PNG or HTML timeline export from the current session

  • artifact loading and distributed diagnostics in one place

  • session-aware switching between multiple captures loaded from the same artifact root

See the TUI Guide for the TUI flow and the CLI Guide for scriptable automation.