Stormlog in real workflows

Stormlog is most useful when you stop thinking of it as “a profiler library” and start thinking of it as a debugging loop:

capture the smallest truthful signal
save it as telemetry or a diagnose bundle
inspect it in the CLI or TUI
decide whether the problem is model behavior, runtime behavior, or environment drift

This article focuses on how that loop fits into daily work.

Why memory issues are hard to debug

The failure mode is usually delayed:

a training loop looks fine for minutes, then OOMs
allocator reserved memory grows while allocated memory looks stable
one rank starts drifting before the rest of the job notices
a CI run only fails on one backend or one Python version

Most teams do not need a giant dashboard first. They need a way to capture a truthful sample, keep it portable, and revisit it later without recreating the whole run.

Stormlog’s value is that the same toolkit supports:

Python-level profiling for bounded operations
CLI-level telemetry capture
terminal-first artifact review in the TUI

The three daily workflows

1. ML engineer instrumenting a training step

This is the workflow for answering:

how much memory does this step cost?
what is the peak?
did this code change shift the allocator profile?

PyTorch path

GPUMemoryProfiler is the right tool when the question is local to one torch.cuda-backed operation. In practice that means NVIDIA CUDA builds and ROCm-backed PyTorch builds surfaced through torch.cuda.

import torch
from stormlog import GPUMemoryProfiler

profiler = GPUMemoryProfiler(track_tensors=True)
device = profiler.device
model = torch.nn.Linear(1024, 256).to(device)

def train_step() -> torch.Tensor:
    x = torch.randn(64, 1024, device=device)
    y = model(x)
    return y.sum()

profile = profiler.profile_function(train_step)
summary = profiler.get_summary()

print(profile.function_name)
print(f"Peak memory: {summary['peak_memory_usage'] / (1024**3):.2f} GB")

If you are on Apple MPS or a CPU-only host, switch to MemoryTracker, the CLI, or the TUI monitoring flows instead of GPUMemoryProfiler.

TensorFlow path

For TensorFlow, the context-manager flow is the clearest equivalent:

from stormlog.tensorflow import TFMemoryProfiler

profiler = TFMemoryProfiler(enable_tensor_tracking=True)

with profiler.profile_context("training"):
    model.fit(x_train, y_train, epochs=1, batch_size=32)

results = profiler.get_results()
print(f"Peak memory: {results.peak_memory_mb:.2f} MB")

What to do with the result

If the issue is isolated to one call, stay in the Python API. If the issue only appears over time, move to the tracker workflow.

2. Researcher or debugger chasing a memory regression

This is the workflow for answering:

why is memory still growing even though the step profile looks fine?
is this a leak, fragmentation, or just a bigger steady-state footprint?
did this regression show up only after a certain runtime change?

Start with telemetry, not screenshots

Use the CLI tracker first:

gpumemprof track --duration 30 --interval 0.5 --output track.json --format json
gpumemprof analyze track.json --format txt --output analysis.txt
gpumemprof diagnose --duration 0 --output ./diag_bundle

For TensorFlow:

tfmemprof monitor --interval 0.5 --duration 30 --output tf_monitor.json
tfmemprof analyze --input tf_monitor.json --detect-leaks --optimize --report tf_report.txt
tfmemprof diagnose --duration 0 --output ./tf_diag

Then load the artifact into the TUI

The TUI is useful after you already have data:

Monitoring tells you whether a live session is actually producing events
Visualizations renders and exports the current timeline
Diagnostics reloads live telemetry or saved artifact paths

Current TUI overview

Current diagnostics view

What the workflow looks like in practice

capture a short run
save track.json or a diagnose bundle
open stormlog
load the artifact in Diagnostics
compare ranks or review anomaly indicators
export a PNG or HTML timeline from Visualizations if you need to share the state

This is faster than trying to narrate a failure from logs alone.

3. CI or release owner triaging regressions

This is the workflow for answering:

did the new branch actually break behavior?
is the problem tied to one backend or environment?
what artifact should we attach to the PR or incident?

Use the maintained examples and scenario runners

Source checkout only. These commands require the repository examples/ package. Pip users should start with the CLI sequence in the practical starting point below.

python -m examples.cli.quickstart
python -m examples.cli.capability_matrix --mode smoke --target both --oom-mode simulated

Those commands are valuable because they already encode the repo’s intended smoke paths. They are a better starting point than hand-written one-off commands in CI comments.

When the smoke path fails

Move to:

gpumemprof diagnose --duration 0 --output ./diag_bundle
tfmemprof diagnose --duration 0 --output ./tf_diag

Then load the artifacts in the TUI or archive them for later review.

How the TUI fits the workflow

The TUI is not a replacement for the CLI. It is the place where saved or live data becomes easier to inspect.

Monitoring tab

Use it when you want:

a live session without writing code
immediate CSV or JSON export
threshold adjustments in the same UI

Monitoring tab

Visualizations tab

Use it when you want:

a quick terminal timeline
a PNG for documents or PRs
an HTML export for interactive review

Visualizations tab

CLI & Actions tab

Use it when you want:

common commands without leaving the app
quick execution of example scenarios
a single place to keep command output attached to the same debugging session

CLI & Actions tab

Distributed diagnostics

Distributed issues are usually not visible from a single rank summary. The current Diagnostics tab can load live telemetry or merged artifacts and rebuild rank-level summaries with filters and anomaly indicators inside the shipped UI shown earlier in this article.

Choosing the right surface

Stay in Python when

the issue is local to one operation
you are iterating quickly inside a notebook or training script

Use the CLI when

you need a portable artifact
you want to automate the check in CI
you want to capture a time-based signal instead of a single profile

Use the TUI when

you already have live telemetry or saved artifacts
you need a visual read on the run without leaving the terminal
you want to compare ranks or review exports interactively

What not to do

do not use screenshots as the source of truth when telemetry is available
do not document a workflow from memory when --help or the examples can confirm it
do not treat GPUMemoryProfiler as a CPU fallback; use the CPU profiler classes or CLI for that path
do not assume gpumemprof analyze and tfmemprof analyze accept the same argument shape

A practical starting point

If you are unsure where to begin, use this order:

Start with the pip-safe CLI path:

gpumemprof info
gpumemprof track --duration 2 --interval 0.5 --output track.json --format json
gpumemprof analyze track.json --format txt --output analysis.txt
gpumemprof diagnose --duration 0 --output ./diag

tfmemprof info
tfmemprof diagnose --duration 0 --output ./tf_diag

# Optional TUI
pip install "stormlog[tui,torch]"
stormlog

If you are working from a source checkout, you can optionally add:

python -m examples.cli.quickstart
python -m examples.cli.capability_matrix --mode smoke --target both --oom-mode simulated

That sequence gives you:

environment truth
a minimal CLI smoke run
a broader release-style validation run
an interactive place to inspect the outputs

Next steps

Read the Usage Guide for the Python API path.
Read the CLI Guide for the automation path.
Read the TUI Guide for the terminal workflow.
Read the Testing and Validation Guide if you are attaching this to CI or release validation.

← Back to main docs