[← Back to Production Cookbook](index.md) # Distributed Diagnostics Recipes Use this page when you need to preserve per-rank identity during capture and rebuild a rank-aware timeline later. Audience: distributed-training owners, incident responders. Difficulty: advanced. ## Prerequisites - install the package first with [Installation](../installation.md) - use `pip install "stormlog[torch]"` for PyTorch rank capture - use `pip install "stormlog[tf]"` for TensorFlow rank capture - use `pip install "stormlog[tui,torch]"` if you want the TUI Diagnostics workflow from a pip install - use [TUI Guide](../tui.md) and [Command Line Guide](../cli.md) if you need UI or CLI reference details - the artifact paths for each rank are writable and distinct - rank metadata is either inferred from the environment or passed explicitly - the TUI extra is installed if you want the interactive Diagnostics workflow Success signal: - each rank produces its own artifact with rank identity intact - the TUI Diagnostics tab can load multiple rank files without flattening them ## Choose the distributed path | If the job is... | Start with... | | --- | --- | | PyTorch rank capture | rank-aware `gpumemprof track` | | TensorFlow rank capture | rank-aware `tfmemprof track` | | artifact triage after capture | `stormlog` Diagnostics tab | ## When this is the right recipe - you need one artifact per rank - you want `job_id`, `rank`, `local_rank`, and `world_size` recorded explicitly - you need TUI diagnostics to keep ranks separate instead of flattening them - you want hidden-memory-gap or collective-attribution analysis with more than one rank ## Recipe: validate a reference `torchrun` DDP run on Jarvis Use this when you want one real multi-GPU training run based on the official PyTorch DDP tutorial pattern, not a manually stitched set of rank-local captures. This workflow was validated against: - PyTorch DDP tutorial: `https://docs.pytorch.org/tutorials/intermediate/ddp_tutorial.html` - PyTorch tutorial-series example repo: `https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series` - local adaptation: `examples.scenarios.torchrun_ddp_reference` Validated environment: - Jarvis container instance - `2xL4` - PyTorch template - single node with `torchrun` Assume the source checkout is already present on the instance at `/home/gpu-memory-profiler`. ```bash jl create \ --gpu L4 \ --num-gpus 2 \ --template pytorch \ --region IN2 \ --storage 80 \ --name stormlog-ddp-reference \ --yes \ --json ``` Prepare the project environment on the instance: ```bash cd /home/gpu-memory-profiler python3 -m venv --system-site-packages .venv . .venv/bin/activate python -m pip install -U pip python -m pip install -e . ``` Run the reference training job with `torchrun`: ```bash cd /home/gpu-memory-profiler . .venv/bin/activate mkdir -p artifacts/jarvis_torchrun_reference PYTHONPATH=/home/gpu-memory-profiler \ OMP_NUM_THREADS=1 \ python -m torch.distributed.run \ --nnodes=1 \ --nproc_per_node=2 \ --master_addr=127.0.0.1 \ --master_port=29501 \ -m examples.scenarios.torchrun_ddp_reference \ --epochs 2 \ --batch-size 128 \ --dataset-size 4096 \ --interval 0.1 \ --job-id jarvis-torchrun-reference \ --output-dir artifacts/jarvis_torchrun_reference \ 2>&1 | tee artifacts/jarvis_torchrun_reference/run.log ``` Expected console output shape: - one loss line per rank per epoch - rank-local and global loss values - a final `Reference summary saved to .../ddp_reference_summary.json` Expected artifacts: - `artifacts/jarvis_torchrun_reference/ddp_reference_summary.json` - `artifacts/jarvis_torchrun_reference/rank0/telemetry_sink/` - `artifacts/jarvis_torchrun_reference/rank1/telemetry_sink/` - `artifacts/jarvis_torchrun_reference/reference_checkpoint.pt` - `artifacts/jarvis_torchrun_reference/run.log` The validated reference run produced: - `world_size = 2` - `rank_summaries = 2` - `12` sample events per rank - `133` `phase_enter` and `133` `phase_exit` events per rank Download the artifact root locally, then load both rank sinks together in the TUI Diagnostics tab: ```bash stormlog ``` Then: 1. Open `Diagnostics`. 2. Enter the two sink paths as a comma-separated list. 3. Click `Load Artifacts`. 4. Leave session selection on `auto`. 5. Confirm `present_ranks` shows `0,1`. For this reference run, Diagnostics selected a merged synthetic session and reported: - `present_ranks = [0, 1]` - `expected_ranks = [0, 1]` - `missing_ranks = []` - one diagnostics row per rank ## Recipe: capture rank-aware PyTorch artifacts ```bash gpumemprof track \ --duration 30 \ --interval 0.5 \ --job-id train-42 \ --rank 0 \ --local-rank 0 \ --world-size 2 \ --output ./rank0.json \ --format json ``` ```bash gpumemprof track \ --duration 30 \ --interval 0.5 \ --job-id train-42 \ --rank 1 \ --local-rank 1 \ --world-size 2 \ --output ./rank1.json \ --format json ``` ## Recipe: capture rank-aware TensorFlow artifacts ```bash tfmemprof track \ --interval 0.5 \ --threshold 4096 \ --device /CPU:0 \ --job-id train-42 \ --rank 0 \ --local-rank 0 \ --world-size 2 \ --output ./tf_rank0.json ``` ```bash tfmemprof track \ --interval 0.5 \ --threshold 4096 \ --device /CPU:0 \ --job-id train-42 \ --rank 1 \ --local-rank 1 \ --world-size 2 \ --output ./tf_rank1.json ``` Stop each TensorFlow rank cleanly with `Ctrl+C` after tracking has started so the per-rank output file is flushed before exit. Keep the same `job_id` across every rank-local capture from one distributed run. Diagnostics uses that shared job identity to auto-select a merged cross-rank session when you load the artifacts together. ## Recipe: load multiple rank artifacts in the TUI ```bash stormlog ``` Then: 1. Open `Diagnostics`. 2. Enter the artifact paths as a comma-separated list. 3. Click `Load Artifacts`. 4. Leave session selection on `auto` or `default` first. With a shared `job_id`, Diagnostics selects the merged cross-rank session automatically. 5. Choose an individual `session_id` only when you want to isolate one raw rank-local artifact. 6. Apply a rank filter such as `all` or `0,1`. ## What to look for - `cross_rank_analysis` in PyTorch optimization reports when more than one rank is present - rank-aware timeline differences in the TUI diagnostics pane - `collective_attribution` when communication phases align with hidden-memory spikes - session separation by `session_id` when the same sink directory or host is reused ## What to do next - If one rank is the first cause, isolate that rank's artifact and analyze it independently. - If all ranks spike together and `collective_attribution` is populated, treat the issue as a communication or synchronization candidate before changing model code. - If the problem is operational rather than rank-local, move to [Always-on Tracking](always_on.md). ## Troubleshooting ### Symptom: only one rank appears in diagnostics Likely cause: the wrong artifact set or session was loaded. Fix: load every rank artifact together, then target the intended session before refreshing. Verify: `present_ranks` matches the expected rank set. ### Symptom: ranks are present but the first cause is unclear Likely cause: the issue is synchronized across ranks. Fix: inspect `cross_rank_analysis` and `collective_attribution` before isolating one rank. Verify: the next action comes from a rank-local or communication-attributed explanation, not guesswork. --- [← Back to Production Cookbook](index.md)