Incident Playbooks
Use these checklists when you already have a failure mode and need the shortest path to the next useful artifact or decision.
Audience: incident responders, ML engineers. Difficulty: intermediate.
Prerequisites
install the package first with Installation
use
pip install "stormlog[torch]"forgpumemprofpathsuse
pip install "stormlog[tf]"fortfmemprofpathsuse
pip install "stormlog[tui,torch]"if you want to pivot into the TUI diagnostics workflowuse Command Line Guide and Troubleshooting Guide for command or environment reference
enough disk space for new diagnose or telemetry artifacts
the relevant CLI entrypoint available in the current environment
a clear incident label before you start: OOM, hidden-memory gap, degraded collector, or retention budget
Success signal:
the incident ends with a saved artifact plus a concrete follow-on recipe
OOM incident
Use this when
the process already failed with an out-of-memory error
you need a portable artifact bundle, not just console logs
Immediate actions
gpumemprof diagnose --duration 0 --output ./diag_bundle
gpumemprof diagnose --native-history --duration 0 --output ./diag_bundle_native
Next steps
Keep the standard diagnose bundle even if you also collect native CUDA history.
Check whether the bundle includes the owning
session_idand a clean or incomplete session status.If the workload is PyTorch CUDA, inspect the native-history artifacts before changing allocator settings.
If the issue is recurring, enable the OOM flight-recorder path from the PyTorch Production Recipes.
Degraded collector incident
Use this when
collector_failure_event_countis non-zerocollector health is not
healthystatus events show
collector_degradedorcollector_recovered
Immediate actions
gpumemprof track \
--duration 30 \
--interval 0.5 \
--output ./track.json \
--format json
tfmemprof track --interval 0.5 --threshold 4096 --device /CPU:0 --output ./tf_track.json
Next steps
Confirm that the run stayed alive and emitted status events instead of synthetic zero-valued samples.
Treat any non-zero collector failure count as actionable even if the final session completed cleanly.
If the collector only fails in long runs, move to the Always-on Tracking page and qualify the deployment under sink retention.
If the failure is reproducible in a short window, archive the bounded artifact and compare it against a clean run.
Retention-budget incident
Use this when
retained files or bytes exceed expectations
rollover or pruning looks abnormal
long runs create more artifact churn than the deployment budget allows
Immediate actions
python -m examples.cli.benchmark_harness \
--profile pr \
--mode soak \
--output artifacts/benchmarks/latest_v0.4.json
This command is source-checkout only. If you installed from PyPI, use a bounded artifact path first:
gpumemprof track --duration 30 --interval 0.5 --output ./track.json --format json
Next steps
Inspect
rollover_count,pruned_segment_count,pruned_bytes,final_retained_files, andfinal_retained_bytestogether.Tighten retention and rollover settings before reducing sampling fidelity.
If the issue needs a repeatable gate, move to CI and Release Qualification.
Troubleshooting
Symptom: the incident does not match one checklist cleanly
Likely cause: the first artifact is too small or the failure mode is mixed.
Fix: capture a bounded track artifact first, then reclassify based on the saved data.
Verify: the next decision is driven by saved evidence rather than a log fragment.