← Back to main docs

Production Cookbook

This cookbook packages Stormlog’s profiling, tracking, diagnostic, and TUI flows into task-oriented recipes for production-facing work.

Use these pages when you already know the tool is installed and want the shortest path to a reliable operational workflow.

Audience: operators, ML engineers, release owners. Difficulty: intermediate.

Before you choose a recipe

  • Read the Installation Guide first if the environment is not already set up.

  • Use the Command Line Guide if you need option-by-option reference instead of a task recipe.

  • If you installed from PyPI, use the pip-safe CLI commands on each page.

  • If you are working from a source checkout, you can also use the maintained examples/ and benchmark_harness flows for qualification.

  • If you need API signatures or option-by-option reference, go back to the Usage Guide, Command Line Guide, or generated API reference.

Choose the right recipe

Goal

Start here

keep long-running artifacts bounded

Always-on Tracking

respond to a PyTorch incident quickly

PyTorch Production Recipes

respond to a TensorFlow incident quickly

TensorFlow Production Recipes

compare ranks or rebuild distributed timelines

Distributed Diagnostics Recipes

triage OOM or hidden-memory-gap findings

Incident Playbooks

qualify operational behavior in CI or before release

CI and Release Qualification

Recipes

Always-on tracking and bounded artifact budgets

Use the Always-on Tracking recipe when you want a long-running tracking session with append-only sink files, retention limits, and explicit guidance for degraded collectors.

PyTorch production profiling and OOM capture

Use the PyTorch Production Recipes page when you need to move from a live PyTorch issue to a saved telemetry or OOM artifact quickly.

TensorFlow production profiling and diagnosis

Use the TensorFlow Production Recipes page when the workload is owned by TensorFlow and you need track, analyze, and diagnose guidance that matches the current tfmemprof behavior.

Distributed and rank-aware diagnosis

Use the Distributed Diagnostics Recipes page when you need to track multiple ranks, preserve rank identity in artifacts, and rebuild rank-aware diagnostics later in the TUI.

Incident triage playbooks

Use the Incident Playbooks page when the main question is what to do next after an OOM, hidden-memory-gap result, degraded collector, or always-on retention issue.

CI and release qualification

Use the CI and Release Qualification page when you need one place for source-checkout smoke commands, benchmark harness gates, and artifact archival guidance.

Suggested reading order

New production deployment

  1. Always-on Tracking

  2. Incident Playbooks

  3. CI and Release Qualification

PyTorch incident response

  1. PyTorch Production Recipes

  2. Incident Playbooks

  3. Distributed Diagnostics Recipes if more than one rank is involved

TensorFlow incident response

  1. TensorFlow Production Recipes

  2. Incident Playbooks

  3. Distributed Diagnostics Recipes if more than one rank is involved


← Back to main docs