[← Back to main docs](../index.md) # Production Cookbook This cookbook packages Stormlog's profiling, tracking, diagnostic, and TUI flows into task-oriented recipes for production-facing work. Use these pages when you already know the tool is installed and want the shortest path to a reliable operational workflow. Audience: operators, ML engineers, release owners. Difficulty: intermediate. ## Before you choose a recipe - Read the [Installation Guide](../installation.md) first if the environment is not already set up. - Use the [Command Line Guide](../cli.md) if you need option-by-option reference instead of a task recipe. - If you installed from PyPI, use the pip-safe CLI commands on each page. - If you are working from a source checkout, you can also use the maintained `examples/` and `benchmark_harness` flows for qualification. - If you need API signatures or option-by-option reference, go back to the [Usage Guide](../usage.md), [Command Line Guide](../cli.md), or generated API reference. ## Choose the right recipe | Goal | Start here | | --- | --- | | keep long-running artifacts bounded | [Always-on Tracking](always_on.md) | | respond to a PyTorch incident quickly | [PyTorch Production Recipes](pytorch.md) | | respond to a TensorFlow incident quickly | [TensorFlow Production Recipes](tensorflow.md) | | compare ranks or rebuild distributed timelines | [Distributed Diagnostics Recipes](distributed.md) | | triage OOM or hidden-memory-gap findings | [Incident Playbooks](incidents.md) | | qualify operational behavior in CI or before release | [CI and Release Qualification](ci_release.md) | ## Recipes ### Always-on tracking and bounded artifact budgets Use the [Always-on Tracking](always_on.md) recipe when you want a long-running tracking session with append-only sink files, retention limits, and explicit guidance for degraded collectors. ### PyTorch production profiling and OOM capture Use the [PyTorch Production Recipes](pytorch.md) page when you need to move from a live PyTorch issue to a saved telemetry or OOM artifact quickly. ### TensorFlow production profiling and diagnosis Use the [TensorFlow Production Recipes](tensorflow.md) page when the workload is owned by TensorFlow and you need track, analyze, and diagnose guidance that matches the current `tfmemprof` behavior. ### Distributed and rank-aware diagnosis Use the [Distributed Diagnostics Recipes](distributed.md) page when you need to track multiple ranks, preserve rank identity in artifacts, and rebuild rank-aware diagnostics later in the TUI. ### Incident triage playbooks Use the [Incident Playbooks](incidents.md) page when the main question is what to do next after an OOM, hidden-memory-gap result, degraded collector, or always-on retention issue. ### CI and release qualification Use the [CI and Release Qualification](ci_release.md) page when you need one place for source-checkout smoke commands, benchmark harness gates, and artifact archival guidance. ## Suggested reading order ### New production deployment 1. [Always-on Tracking](always_on.md) 2. [Incident Playbooks](incidents.md) 3. [CI and Release Qualification](ci_release.md) ### PyTorch incident response 1. [PyTorch Production Recipes](pytorch.md) 2. [Incident Playbooks](incidents.md) 3. [Distributed Diagnostics Recipes](distributed.md) if more than one rank is involved ### TensorFlow incident response 1. [TensorFlow Production Recipes](tensorflow.md) 2. [Incident Playbooks](incidents.md) 3. [Distributed Diagnostics Recipes](distributed.md) if more than one rank is involved --- [← Back to main docs](../index.md)