Incident Playbooks

Use these checklists when you already have a failure mode and need the shortest path to the next useful artifact or decision.

Audience: incident responders, ML engineers. Difficulty: intermediate.

Prerequisites

install the package first with Installation
use pip install "stormlog[torch]" for gpumemprof paths
use pip install "stormlog[tf]" for tfmemprof paths
use pip install "stormlog[tui,torch]" if you want to pivot into the TUI diagnostics workflow
use Command Line Guide and Troubleshooting Guide for command or environment reference
enough disk space for new diagnose or telemetry artifacts
the relevant CLI entrypoint available in the current environment
a clear incident label before you start: OOM, hidden-memory gap, degraded collector, or retention budget

Success signal:

the incident ends with a saved artifact plus a concrete follow-on recipe

OOM incident

Use this when

the process already failed with an out-of-memory error
you need a portable artifact bundle, not just console logs

Immediate actions

gpumemprof diagnose --duration 0 --output ./diag_bundle

gpumemprof diagnose --native-history --duration 0 --output ./diag_bundle_native

Next steps

Keep the standard diagnose bundle even if you also collect native CUDA history.
Check whether the bundle includes the owning session_id and a clean or incomplete session status.
If the workload is PyTorch CUDA, inspect the native-history artifacts before changing allocator settings.
If the issue is recurring, enable the OOM flight-recorder path from the PyTorch Production Recipes.

Hidden-memory-gap incident

Use this when

allocated memory does not explain the observed peak
the report includes gap_analysis
the multi-rank case suggests communication-driven spikes

Immediate actions

gpumemprof analyze ./track.json --format txt --output ./analysis.txt

tfmemprof analyze --input ./tf_monitor.json --detect-leaks --optimize --report ./tf_report.txt

Next steps

Start with gap_analysis before assuming a leak.
If collective_attribution is populated in a multi-rank run, treat communication or synchronization as a first-class cause.
Load the artifacts into the TUI Diagnostics tab if you need rank filtering and timeline focus.
If the workload is distributed, move to the Distributed Diagnostics Recipes.

Degraded collector incident

Use this when

collector_failure_event_count is non-zero
collector health is not healthy
status events show collector_degraded or collector_recovered

Immediate actions

gpumemprof track \
  --duration 30 \
  --interval 0.5 \
  --output ./track.json \
  --format json

tfmemprof track --interval 0.5 --threshold 4096 --device /CPU:0 --output ./tf_track.json

Next steps

Confirm that the run stayed alive and emitted status events instead of synthetic zero-valued samples.
Treat any non-zero collector failure count as actionable even if the final session completed cleanly.
If the collector only fails in long runs, move to the Always-on Tracking page and qualify the deployment under sink retention.
If the failure is reproducible in a short window, archive the bounded artifact and compare it against a clean run.

Retention-budget incident

Use this when

retained files or bytes exceed expectations
rollover or pruning looks abnormal
long runs create more artifact churn than the deployment budget allows

Immediate actions

python -m examples.cli.benchmark_harness \
  --profile pr \
  --mode soak \
  --output artifacts/benchmarks/latest_v0.4.json

This command is source-checkout only. If you installed from PyPI, use a bounded artifact path first:

gpumemprof track --duration 30 --interval 0.5 --output ./track.json --format json

Next steps

Inspect rollover_count, pruned_segment_count, pruned_bytes, final_retained_files, and final_retained_bytes together.
Tighten retention and rollover settings before reducing sampling fidelity.
If the issue needs a repeatable gate, move to CI and Release Qualification.

Troubleshooting

Symptom: the incident does not match one checklist cleanly

Likely cause: the first artifact is too small or the failure mode is mixed. Fix: capture a bounded track artifact first, then reclassify based on the saved data. Verify: the next decision is driven by saved evidence rather than a log fragment.

← Back to Production Cookbook