← Back to Production Cookbook

Incident Playbooks

Use these checklists when you already have a failure mode and need the shortest path to the next useful artifact or decision.

Audience: incident responders, ML engineers. Difficulty: intermediate.

Prerequisites

  • install the package first with Installation

  • use pip install "stormlog[torch]" for gpumemprof paths

  • use pip install "stormlog[tf]" for tfmemprof paths

  • use pip install "stormlog[tui,torch]" if you want to pivot into the TUI diagnostics workflow

  • use Command Line Guide and Troubleshooting Guide for command or environment reference

  • enough disk space for new diagnose or telemetry artifacts

  • the relevant CLI entrypoint available in the current environment

  • a clear incident label before you start: OOM, hidden-memory gap, degraded collector, or retention budget

Success signal:

  • the incident ends with a saved artifact plus a concrete follow-on recipe

OOM incident

Use this when

  • the process already failed with an out-of-memory error

  • you need a portable artifact bundle, not just console logs

Immediate actions

gpumemprof diagnose --duration 0 --output ./diag_bundle
gpumemprof diagnose --native-history --duration 0 --output ./diag_bundle_native

Next steps

  1. Keep the standard diagnose bundle even if you also collect native CUDA history.

  2. Check whether the bundle includes the owning session_id and a clean or incomplete session status.

  3. If the workload is PyTorch CUDA, inspect the native-history artifacts before changing allocator settings.

  4. If the issue is recurring, enable the OOM flight-recorder path from the PyTorch Production Recipes.

Hidden-memory-gap incident

Use this when

  • allocated memory does not explain the observed peak

  • the report includes gap_analysis

  • the multi-rank case suggests communication-driven spikes

Immediate actions

gpumemprof analyze ./track.json --format txt --output ./analysis.txt
tfmemprof analyze --input ./tf_monitor.json --detect-leaks --optimize --report ./tf_report.txt

Next steps

  1. Start with gap_analysis before assuming a leak.

  2. If collective_attribution is populated in a multi-rank run, treat communication or synchronization as a first-class cause.

  3. Load the artifacts into the TUI Diagnostics tab if you need rank filtering and timeline focus.

  4. If the workload is distributed, move to the Distributed Diagnostics Recipes.

Degraded collector incident

Use this when

  • collector_failure_event_count is non-zero

  • collector health is not healthy

  • status events show collector_degraded or collector_recovered

Immediate actions

gpumemprof track \
  --duration 30 \
  --interval 0.5 \
  --output ./track.json \
  --format json
tfmemprof track --interval 0.5 --threshold 4096 --device /CPU:0 --output ./tf_track.json

Next steps

  1. Confirm that the run stayed alive and emitted status events instead of synthetic zero-valued samples.

  2. Treat any non-zero collector failure count as actionable even if the final session completed cleanly.

  3. If the collector only fails in long runs, move to the Always-on Tracking page and qualify the deployment under sink retention.

  4. If the failure is reproducible in a short window, archive the bounded artifact and compare it against a clean run.

Retention-budget incident

Use this when

  • retained files or bytes exceed expectations

  • rollover or pruning looks abnormal

  • long runs create more artifact churn than the deployment budget allows

Immediate actions

python -m examples.cli.benchmark_harness \
  --profile pr \
  --mode soak \
  --output artifacts/benchmarks/latest_v0.4.json

This command is source-checkout only. If you installed from PyPI, use a bounded artifact path first:

gpumemprof track --duration 30 --interval 0.5 --output ./track.json --format json

Next steps

  1. Inspect rollover_count, pruned_segment_count, pruned_bytes, final_retained_files, and final_retained_bytes together.

  2. Tighten retention and rollover settings before reducing sampling fidelity.

  3. If the issue needs a repeatable gate, move to CI and Release Qualification.

Troubleshooting

Symptom: the incident does not match one checklist cleanly

Likely cause: the first artifact is too small or the failure mode is mixed. Fix: capture a bounded track artifact first, then reclassify based on the saved data. Verify: the next decision is driven by saved evidence rather than a log fragment.


← Back to Production Cookbook