Usage Guide
This guide focuses on the workflows that are stable in the current codebase:
profile a bounded section of code
track memory over time
export telemetry and plots
move from runtime data to a diagnose bundle or TUI session
If you want task-oriented production recipes instead of a capability guide, use the Production Cookbook, especially Always-on Tracking, PyTorch Production Recipes, and TensorFlow Production Recipes.
Install the distribution as stormlog, then import the Python APIs from
stormlog, stormlog.tensorflow, or stormlog.jax. The CLI automation commands are
gpumemprof, tfmemprof, and jaxmemprof.
Choose the right tool
GPUMemoryProfiler
Use when:
you have a PyTorch runtime exposed through
torch.cuda(CUDA or ROCm)you want per-call or per-context profiling
you care about allocated vs reserved GPU memory during a bounded operation
TFMemoryProfiler
Use when:
you are profiling TensorFlow code directly
you want snapshots plus aggregated TensorFlow profiling results
JAXMemoryProfiler
Use when:
you are profiling JAX code directly
you want memory usage insights taking XLA compilation into account
you need snapshots of JAX device allocations over time
MemoryTracker
Use when:
you need telemetry over time rather than one profiled call
you want backend-aware tracking on CUDA, ROCm, MPS, or CPU fallback paths
you want alerts, exported event streams, or diagnose bundles
CPUMemoryProfiler / CPUMemoryTracker
Use when:
you are on a CPU-only machine
you want the same workflow shape without CUDA-specific profiling
Canonical workflow
For a new environment or a new project, use this sequence:
gpumemprof info
gpumemprof track --duration 2 --interval 0.5 --output /tmp/gpumemprof_track.json --format json
gpumemprof analyze /tmp/gpumemprof_track.json --format txt --output /tmp/gpumemprof_analysis.txt
gpumemprof diagnose --duration 0 --output /tmp/gpumemprof_diag
tfmemprof info
tfmemprof diagnose --duration 0 --output /tmp/tf_diag
jaxmemprof info
jaxmemprof track --duration 2 --interval 0.5 --output /tmp/jaxmemprof_track.json --format json
jaxmemprof analyze /tmp/jaxmemprof_track.json --format txt --output /tmp/jaxmemprof_analysis.txt
jaxmemprof diagnose --duration 0 --output /tmp/jax_diag
This gives you:
environment visibility
a short telemetry sample
a readable analysis artifact
a portable diagnose bundle you can load later
For long-running captures, Stormlog now treats every track run or standalone
diagnose bundle as its own session. That makes it possible to reuse the same
sink directory across runs and still reconstruct or audit one capture at a
time.
If you are working from a source checkout, you can optionally add the maintained example smoke paths:
python -m examples.cli.quickstart
python -m examples.cli.capability_matrix --mode smoke --target both --oom-mode simulated
The examples/ package is not included in the PyPI distribution.
PyTorch profiling
GPUMemoryProfiler currently targets torch.cuda runtimes. That includes
NVIDIA CUDA builds and ROCm-backed PyTorch builds surfaced through
torch.cuda. If you are on Apple MPS or CPU-only hardware, move to
MemoryTracker, the CLI, or the CPU-only path below.
import torch
from stormlog import GPUMemoryProfiler
profiler = GPUMemoryProfiler(track_tensors=True)
device = profiler.device
model = torch.nn.Linear(1024, 256).to(device)
def train_step() -> torch.Tensor:
inputs = torch.randn(64, 1024, device=device)
outputs = model(inputs)
return outputs.sum()
profile = profiler.profile_function(train_step)
summary = profiler.get_summary()
print(profile.function_name)
print(f"Peak memory: {summary['peak_memory_usage'] / (1024**3):.2f} GB")
Context profiling
import torch
from stormlog import GPUMemoryProfiler
profiler = GPUMemoryProfiler()
device = profiler.device
with profiler.profile_context("forward_pass"):
x = torch.randn(32, 1024, device=device)
y = torch.nn.Linear(1024, 128).to(device)(x)
TensorFlow profiling
from stormlog.tensorflow import TFMemoryProfiler
profiler = TFMemoryProfiler(enable_tensor_tracking=True)
with profiler.profile_context("training"):
model.fit(x_train, y_train, epochs=1, batch_size=32)
results = profiler.get_results()
print(f"Peak memory: {results.peak_memory_mb:.2f} MB")
print(f"Snapshots captured: {len(results.snapshots)}")
For CPU-only TensorFlow or when the GPU backend is unavailable, initialize the
profiler with TFMemoryProfiler(device="/CPU:0").
JAX profiling
import jax.numpy as jnp
from stormlog.jax import JAXMemoryProfiler
profiler = JAXMemoryProfiler()
with profiler.profile_context("training"):
x = jnp.ones((1000, 1000))
y = jnp.dot(x, x)
# Block to ensure accurate profile boundaries due to async dispatch
y.block_until_ready()
results = profiler.get_results()
print(f"Peak memory: {results.peak_memory_mb:.2f} MB")
print(f"Snapshots captured: {len(results.snapshots)}")
For CPU-only JAX execution, the profiler will automatically adapt to the cpu backend.
JAX OOM Flight Recorder and Graph Visualization
When tracking JAX execution over time, you can enable the OOM flight recorder to capture detailed XLA device memory profiles (.prof files) upon an Out-Of-Memory crash.
Stormlog automatically parses this dump and generates a completely standalone, interactive WebAssembly graph viewer (jax-device-memory-graph.html) right next to the .prof file in the oom_dumps directory.
You can view the resulting diagnostic graph in two ways:
Dependency-free HTML: Simply open the
jax-device-memory-graph.htmlfile in your web browser to see an interactive Directed Graph of the call stack (no Go,protoc, or Graphviz installation required).Official Go Tool: If you prefer the standard Go toolchain, you can run
go tool pprof -http=:8080 <path_to_.prof>.
CPU-only workflow
Use this when PyTorch CUDA profiling is unavailable but you still want a local validation path:
from stormlog import CPUMemoryProfiler
profiler = CPUMemoryProfiler()
with profiler.profile_context("cpu_step"):
values = [i * i for i in range(100_000)]
values.reverse()
summary = profiler.get_summary()
print(summary["mode"])
print(summary["snapshots_collected"])
Tracking over time
PyTorch tracker
MemoryTracker follows the PyTorch tracker stack. Install
stormlog[torch] before using this example. On Apple Silicon, this is the
MPS-aware tracker path once the PyTorch extra is installed.
from stormlog import MemoryTracker
tracker = MemoryTracker(
sampling_interval=0.5,
enable_alerts=True,
)
tracker.start_tracking()
# run workload here
tracker.stop_tracking()
stats = tracker.get_statistics()
print(f"Peak memory: {stats.get('peak_memory', 0) / (1024**3):.2f} GB")
print(f"Events: {stats.get('total_events', 0)}")
Structured workload phases
All long-running trackers can emit explicit workload-phase boundaries while tracking is active:
from stormlog import CPUMemoryTracker
tracker = CPUMemoryTracker(sampling_interval=0.1)
tracker.start_tracking()
with tracker.phase("train", metadata={"epoch": 1}):
with tracker.phase("forward", metadata={"microbatch": 4}):
_ = [i * i for i in range(50_000)]
tracker.stop_tracking()
This writes phase_enter and phase_exit telemetry records with structured
metadata["phase_scope"] payloads. The same API shape is available on
MemoryTracker, CPUMemoryTracker, stormlog.tensorflow.MemoryTracker, and stormlog.jax.MemoryTracker.
When those phase boundaries are present, gpumemprof analyze and the TUI
Diagnostics tab can attach hidden-memory gaps, collective spikes, and
cross-rank first-cause suspects back to the active phase path.
If the underlying collector becomes unstable, MemoryTracker keeps the tracking
session alive, exposes collector health through get_statistics(), and emits
collector_degraded / collector_recovered events instead of exporting
synthetic zero-valued samples.
The tracker session lifecycle is:
session begins after tracker startup succeeds and before the first persisted record
clean
stop_tracking()finalization marks the sessioncompletedrecovered append-only runs are marked
interruptedpartial artifacts without provable shutdown remain
incomplete
For always-on deployments, treat the tracker as an operating-budgeted service:
keep the default sink retention enabled so artifacts stay bounded
watch
rollover_count,pruned_segment_count, andfinal_retained_*watch
history_dropped_*to confirm bounded in-memory windows are evicting as expectedtreat any non-zero
collector_failure_event_countor non-healthycollector state as actionable
The maintained source-checkout qualification path for those guarantees is the
v0.4 operability harness in examples/cli/benchmark_harness.py.
For CUDA-only OOM debugging, enable native allocator-history capture and wrap
the risky block with capture_oom():
from stormlog import MemoryTracker
tracker = MemoryTracker(
enable_oom_flight_recorder=True,
enable_native_cuda_history=True,
)
with tracker.capture_oom(context="train-step"):
run_training_step()
If that block raises a CUDA OOM, the OOM dump bundle is extended with native
snapshot artifacts and pointer-attribution summaries.
The OOM manifest and metadata also include the owning session_id, so the dump
can be traced back to the exact tracking run that produced it.
CPU tracker
from stormlog import CPUMemoryTracker
tracker = CPUMemoryTracker(sampling_interval=0.5)
tracker.start_tracking()
# run workload here
tracker.stop_tracking()
stats = tracker.get_statistics()
print(stats["total_events"])
print(stats["final_retained_files"])
print(stats["history_dropped_events"])
Reusing a sink directory across runs
You can keep a long-lived sink directory and still separate captures deterministically:
gpumemprof track --telemetry-sink-dir ./live_sink --duration 30
gpumemprof track --telemetry-sink-dir ./live_sink --duration 30
gpumemprof analyze ./live_sink
gpumemprof analyze ./live_sink --session-id "<session-id-from-report>"
The default loader behavior is:
newest
completednewest
interruptednewest
incomplete
Use explicit session selection when you want an older or partial run.
Plot exports
MemoryVisualizer is for saved plots and exported data once you already have profiler results or monitoring snapshots.
import torch
from stormlog import GPUMemoryProfiler, MemoryVisualizer
profiler = GPUMemoryProfiler()
device = profiler.device
def sample_workload() -> torch.Tensor:
x = torch.randn(32, 64, device=device)
return x.sum()
profile = profiler.profile_function(sample_workload)
visualizer = MemoryVisualizer(profiler)
visualizer.plot_memory_timeline(interactive=False, save_path="timeline.png")
Install stormlog[viz] before relying on PNG or HTML exports.
When to switch to the TUI
Use the TUI when you want:
a live monitoring session without writing custom code
quick CSV/JSON export from an active tracker
PNG or HTML timeline export from the current session
artifact loading and distributed diagnostics in one place
session-aware switching between multiple captures loaded from the same artifact root
See the TUI Guide for the TUI flow and the CLI Guide for scriptable automation.