← Back to README
Stormlog Documentation
Stormlog ships three surfaces that should be treated as one workflow:
Python APIs for profiling or tracking inside code
CLI commands for telemetry capture and artifact generation
a Textual TUI for live monitoring, visualization export, and diagnostics
Use the guides below based on the job you are doing, not based on package internals.
Important: Pip vs Source Checkout
If you installed Stormlog via pip install stormlog (from PyPI):
The
examples/package is not included. Commands likepython -m examples.cli.quickstartwill fail withModuleNotFoundError.Use the CLI commands and Python snippets in this documentation instead. The
gpumemprof,tfmemprof, andjaxmemprofCLIs and the public Python APIs work with a pip install.Install
stormlog[tui,torch]if you want thestormlogTUI entrypoint from a pip install.The TUI Capability Matrix and OOM scenario buttons run example modules. If those fail, use the inline command runner in the TUI with the equivalent CLI commands from this guide.
If you cloned the repository and installed with pip install -e .:
You have access to the
examples/,tests/, anddocs/source trees. The example modules and scenario runners will work.
Guides
- Installation Guide
- Usage Guide
- Command Line Guide
- Terminal UI Guide
- Production Cookbook
- Always-on Tracking
- PyTorch Production Recipes
- Prerequisites
- Choose the first PyTorch recipe
- Recipe: profile a single GPU step
- Recipe: validate a real L4 training run and pull back the artifacts
- Recipe: capture a bounded CLI artifact window
- Recipe: export a diagnose bundle
- Recipe: capture CUDA-native OOM evidence
- Recipe: generate the annotated allocator-history HTML
- Recipe: turn saved telemetry into a report
- Recipe: rehearse the OOM workflow safely from a source checkout
- What to look for in the report
- What to do next
- Troubleshooting
- TensorFlow Production Recipes
- Prerequisites
- Choose the first TensorFlow recipe
- Recipe: profile a GPU matmul step
- Recipe: capture a bounded CLI timeline
- Recipe: track TensorFlow memory over time
- Recipe: run TensorFlow analysis
- Recipe: produce a diagnose bundle
- Recipe: run the end-to-end TensorFlow flow from a source checkout
- What to look for in the results
- What to do next
- Troubleshooting
- JAX Production Recipes
- Distributed Diagnostics Recipes
- Prerequisites
- Choose the distributed path
- When this is the right recipe
- Recipe: validate a reference
torchrunDDP run on Jarvis - Recipe: capture rank-aware PyTorch artifacts
- Recipe: capture rank-aware TensorFlow artifacts
- Recipe: load multiple rank artifacts in the TUI
- What to look for
- What to do next
- Troubleshooting
- Incident Playbooks
- CI and Release Qualification
- Examples Guide
- Testing and Validation Guide
- Troubleshooting Guide
- CPU Compatibility Guide
- Compatibility Matrix
- Benchmark Harness (v0.4)
- TelemetryEvent v3 Schema
- Required fields
- Session identity and lifecycle
- Distributed identity fields
- Collector values
- Backend capability metadata
- Collector health metadata
- Structured phase metadata
- Timeline markers
- Analyzer phase attribution payloads
- Legacy compatibility
- Distributed env inference
- Python API
- Append-only sink layout
- Diagnose and OOM manifests
- Reconstructing a capture
- Stormlog Telemetry Projection
- Local Query Layer
- Durable Issue Fingerprinting
- PyTorch Guide
- TensorFlow Guide
- JAX Testing Guide
- Stormlog in real workflows
- Why memory issues are hard to debug
- The three daily workflows
- 1. ML engineer instrumenting a training step
- 2. Researcher or debugger chasing a memory regression
- 3. CI or release owner triaging regressions
- How the TUI fits the workflow
- Distributed diagnostics
- Choosing the right surface
- What not to do
- A practical starting point
- Next steps
- Architecture Guide
- API Reference
- API Reference (Generated)
- GPU Stack Installation & CUDA Enablement
- Testing Guides (Markdown Edition)
Suggested reading order
New user
Debugging a real run
Release or CI validation
Framework-specific workflows
Notes
docs/_build/is generated output and not maintained as source documentation.When docs and code disagree, treat the code and
--helpoutput as the source of truth and update the docs.