[← Back to Production Cookbook](index.md) # Incident Playbooks Use these checklists when you already have a failure mode and need the shortest path to the next useful artifact or decision. Audience: incident responders, ML engineers. Difficulty: intermediate. ## Prerequisites - install the package first with [Installation](../installation.md) - use `pip install "stormlog[torch]"` for `gpumemprof` paths - use `pip install "stormlog[tf]"` for `tfmemprof` paths - use `pip install "stormlog[tui,torch]"` if you want to pivot into the TUI diagnostics workflow - use [Command Line Guide](../cli.md) and [Troubleshooting Guide](../troubleshooting.md) for command or environment reference - enough disk space for new diagnose or telemetry artifacts - the relevant CLI entrypoint available in the current environment - a clear incident label before you start: OOM, hidden-memory gap, degraded collector, or retention budget Success signal: - the incident ends with a saved artifact plus a concrete follow-on recipe ## OOM incident ### Use this when - the process already failed with an out-of-memory error - you need a portable artifact bundle, not just console logs ### Immediate actions ```bash gpumemprof diagnose --duration 0 --output ./diag_bundle ``` ```bash gpumemprof diagnose --native-history --duration 0 --output ./diag_bundle_native ``` ### Next steps 1. Keep the standard diagnose bundle even if you also collect native CUDA history. 2. Check whether the bundle includes the owning `session_id` and a clean or incomplete session status. 3. If the workload is PyTorch CUDA, inspect the native-history artifacts before changing allocator settings. 4. If the issue is recurring, enable the OOM flight-recorder path from the [PyTorch Production Recipes](pytorch.md). ## Hidden-memory-gap incident ### Use this when - allocated memory does not explain the observed peak - the report includes `gap_analysis` - the multi-rank case suggests communication-driven spikes ### Immediate actions ```bash gpumemprof analyze ./track.json --format txt --output ./analysis.txt ``` ```bash tfmemprof analyze --input ./tf_monitor.json --detect-leaks --optimize --report ./tf_report.txt ``` ### Next steps 1. Start with `gap_analysis` before assuming a leak. 2. If `collective_attribution` is populated in a multi-rank run, treat communication or synchronization as a first-class cause. 3. Load the artifacts into the TUI Diagnostics tab if you need rank filtering and timeline focus. 4. If the workload is distributed, move to the [Distributed Diagnostics Recipes](distributed.md). ## Degraded collector incident ### Use this when - `collector_failure_event_count` is non-zero - collector health is not `healthy` - status events show `collector_degraded` or `collector_recovered` ### Immediate actions ```bash gpumemprof track \ --duration 30 \ --interval 0.5 \ --output ./track.json \ --format json ``` ```bash tfmemprof track --interval 0.5 --threshold 4096 --device /CPU:0 --output ./tf_track.json ``` ### Next steps 1. Confirm that the run stayed alive and emitted status events instead of synthetic zero-valued samples. 2. Treat any non-zero collector failure count as actionable even if the final session completed cleanly. 3. If the collector only fails in long runs, move to the [Always-on Tracking](always_on.md) page and qualify the deployment under sink retention. 4. If the failure is reproducible in a short window, archive the bounded artifact and compare it against a clean run. ## Retention-budget incident ### Use this when - retained files or bytes exceed expectations - rollover or pruning looks abnormal - long runs create more artifact churn than the deployment budget allows ### Immediate actions ```bash python -m examples.cli.benchmark_harness \ --profile pr \ --mode soak \ --output artifacts/benchmarks/latest_v0.4.json ``` This command is source-checkout only. If you installed from PyPI, use a bounded artifact path first: ```bash gpumemprof track --duration 30 --interval 0.5 --output ./track.json --format json ``` ### Next steps 1. Inspect `rollover_count`, `pruned_segment_count`, `pruned_bytes`, `final_retained_files`, and `final_retained_bytes` together. 2. Tighten retention and rollover settings before reducing sampling fidelity. 3. If the issue needs a repeatable gate, move to [CI and Release Qualification](ci_release.md). ## Troubleshooting ### Symptom: the incident does not match one checklist cleanly Likely cause: the first artifact is too small or the failure mode is mixed. Fix: capture a bounded `track` artifact first, then reclassify based on the saved data. Verify: the next decision is driven by saved evidence rather than a log fragment. --- [← Back to Production Cookbook](index.md)