[← Back to main docs](index.md) # Stormlog in real workflows Stormlog is most useful when you stop thinking of it as "a profiler library" and start thinking of it as a debugging loop: 1. capture the smallest truthful signal 2. save it as telemetry or a diagnose bundle 3. inspect it in the CLI or TUI 4. decide whether the problem is model behavior, runtime behavior, or environment drift This article focuses on how that loop fits into daily work. ## Why memory issues are hard to debug The failure mode is usually delayed: - a training loop looks fine for minutes, then OOMs - allocator reserved memory grows while allocated memory looks stable - one rank starts drifting before the rest of the job notices - a CI run only fails on one backend or one Python version Most teams do not need a giant dashboard first. They need a way to capture a truthful sample, keep it portable, and revisit it later without recreating the whole run. Stormlog’s value is that the same toolkit supports: - Python-level profiling for bounded operations - CLI-level telemetry capture - terminal-first artifact review in the TUI ## The three daily workflows ## 1. ML engineer instrumenting a training step This is the workflow for answering: - how much memory does this step cost? - what is the peak? - did this code change shift the allocator profile? ### PyTorch path `GPUMemoryProfiler` is the right tool when the question is local to one `torch.cuda`-backed operation. In practice that means NVIDIA CUDA builds and ROCm-backed PyTorch builds surfaced through `torch.cuda`. ```python import torch from stormlog import GPUMemoryProfiler profiler = GPUMemoryProfiler(track_tensors=True) device = profiler.device model = torch.nn.Linear(1024, 256).to(device) def train_step() -> torch.Tensor: x = torch.randn(64, 1024, device=device) y = model(x) return y.sum() profile = profiler.profile_function(train_step) summary = profiler.get_summary() print(profile.function_name) print(f"Peak memory: {summary['peak_memory_usage'] / (1024**3):.2f} GB") ``` If you are on Apple MPS or a CPU-only host, switch to `MemoryTracker`, the CLI, or the TUI monitoring flows instead of `GPUMemoryProfiler`. ### TensorFlow path For TensorFlow, the context-manager flow is the clearest equivalent: ```python from stormlog.tensorflow import TFMemoryProfiler profiler = TFMemoryProfiler(enable_tensor_tracking=True) with profiler.profile_context("training"): model.fit(x_train, y_train, epochs=1, batch_size=32) results = profiler.get_results() print(f"Peak memory: {results.peak_memory_mb:.2f} MB") ``` ### What to do with the result If the issue is isolated to one call, stay in the Python API. If the issue only appears over time, move to the tracker workflow. ## 2. Researcher or debugger chasing a memory regression This is the workflow for answering: - why is memory still growing even though the step profile looks fine? - is this a leak, fragmentation, or just a bigger steady-state footprint? - did this regression show up only after a certain runtime change? ### Start with telemetry, not screenshots Use the CLI tracker first: ```bash gpumemprof track --duration 30 --interval 0.5 --output track.json --format json gpumemprof analyze track.json --format txt --output analysis.txt gpumemprof diagnose --duration 0 --output ./diag_bundle ``` For TensorFlow: ```bash tfmemprof monitor --interval 0.5 --duration 30 --output tf_monitor.json tfmemprof analyze --input tf_monitor.json --detect-leaks --optimize --report tf_report.txt tfmemprof diagnose --duration 0 --output ./tf_diag ``` ### Then load the artifact into the TUI The TUI is useful after you already have data: - `Monitoring` tells you whether a live session is actually producing events - `Visualizations` renders and exports the current timeline - `Diagnostics` reloads live telemetry or saved artifact paths ![Current TUI overview](tui-overview-current.png) ![Current diagnostics view](tui-diagnostics-current.png) ### What the workflow looks like in practice 1. capture a short run 2. save `track.json` or a diagnose bundle 3. open `stormlog` 4. load the artifact in `Diagnostics` 5. compare ranks or review anomaly indicators 6. export a PNG or HTML timeline from `Visualizations` if you need to share the state This is faster than trying to narrate a failure from logs alone. ## 3. CI or release owner triaging regressions This is the workflow for answering: - did the new branch actually break behavior? - is the problem tied to one backend or environment? - what artifact should we attach to the PR or incident? ### Use the maintained examples and scenario runners > **Source checkout only.** These commands require the repository `examples/` > package. Pip users should start with the CLI sequence in the practical > starting point below. ```bash python -m examples.cli.quickstart python -m examples.cli.capability_matrix --mode smoke --target both --oom-mode simulated ``` Those commands are valuable because they already encode the repo’s intended smoke paths. They are a better starting point than hand-written one-off commands in CI comments. ### When the smoke path fails Move to: ```bash gpumemprof diagnose --duration 0 --output ./diag_bundle tfmemprof diagnose --duration 0 --output ./tf_diag ``` Then load the artifacts in the TUI or archive them for later review. ## How the TUI fits the workflow The TUI is not a replacement for the CLI. It is the place where saved or live data becomes easier to inspect. ### Monitoring tab Use it when you want: - a live session without writing code - immediate CSV or JSON export - threshold adjustments in the same UI ![Monitoring tab](tui-monitoring-current.png) ### Visualizations tab Use it when you want: - a quick terminal timeline - a PNG for documents or PRs - an HTML export for interactive review ![Visualizations tab](tui-visualizations-current.png) ### CLI & Actions tab Use it when you want: - common commands without leaving the app - quick execution of example scenarios - a single place to keep command output attached to the same debugging session ![CLI & Actions tab](tui-cli-actions-current.png) ## Distributed diagnostics Distributed issues are usually not visible from a single rank summary. The current Diagnostics tab can load live telemetry or merged artifacts and rebuild rank-level summaries with filters and anomaly indicators inside the shipped UI shown earlier in this article. ## Choosing the right surface ### Stay in Python when - the issue is local to one operation - you are iterating quickly inside a notebook or training script ### Use the CLI when - you need a portable artifact - you want to automate the check in CI - you want to capture a time-based signal instead of a single profile ### Use the TUI when - you already have live telemetry or saved artifacts - you need a visual read on the run without leaving the terminal - you want to compare ranks or review exports interactively ## What not to do - do not use screenshots as the source of truth when telemetry is available - do not document a workflow from memory when `--help` or the examples can confirm it - do not treat `GPUMemoryProfiler` as a CPU fallback; use the CPU profiler classes or CLI for that path - do not assume `gpumemprof analyze` and `tfmemprof analyze` accept the same argument shape ## A practical starting point If you are unsure where to begin, use this order: Start with the pip-safe CLI path: ```bash gpumemprof info gpumemprof track --duration 2 --interval 0.5 --output track.json --format json gpumemprof analyze track.json --format txt --output analysis.txt gpumemprof diagnose --duration 0 --output ./diag tfmemprof info tfmemprof diagnose --duration 0 --output ./tf_diag # Optional TUI pip install "stormlog[tui,torch]" stormlog ``` If you are working from a source checkout, you can optionally add: ```bash python -m examples.cli.quickstart python -m examples.cli.capability_matrix --mode smoke --target both --oom-mode simulated ``` That sequence gives you: - environment truth - a minimal CLI smoke run - a broader release-style validation run - an interactive place to inspect the outputs ## Next steps - Read the [Usage Guide](usage.md) for the Python API path. - Read the [CLI Guide](cli.md) for the automation path. - Read the [TUI Guide](tui.md) for the terminal workflow. - Read the [Testing and Validation Guide](testing.md) if you are attaching this to CI or release validation. --- [← Back to main docs](index.md)