← Back to Production Cookbook

Distributed Diagnostics Recipes

Use this page when you need to preserve per-rank identity during capture and rebuild a rank-aware timeline later.

Audience: distributed-training owners, incident responders. Difficulty: advanced.

Prerequisites

  • install the package first with Installation

  • use pip install "stormlog[torch]" for PyTorch rank capture

  • use pip install "stormlog[tf]" for TensorFlow rank capture

  • use pip install "stormlog[tui,torch]" if you want the TUI Diagnostics workflow from a pip install

  • use TUI Guide and Command Line Guide if you need UI or CLI reference details

  • the artifact paths for each rank are writable and distinct

  • rank metadata is either inferred from the environment or passed explicitly

  • the TUI extra is installed if you want the interactive Diagnostics workflow

Success signal:

  • each rank produces its own artifact with rank identity intact

  • the TUI Diagnostics tab can load multiple rank files without flattening them

Choose the distributed path

If the job is…

Start with…

PyTorch rank capture

rank-aware gpumemprof track

TensorFlow rank capture

rank-aware tfmemprof track

artifact triage after capture

stormlog Diagnostics tab

When this is the right recipe

  • you need one artifact per rank

  • you want job_id, rank, local_rank, and world_size recorded explicitly

  • you need TUI diagnostics to keep ranks separate instead of flattening them

  • you want hidden-memory-gap or collective-attribution analysis with more than one rank

Recipe: validate a reference torchrun DDP run on Jarvis

Use this when you want one real multi-GPU training run based on the official PyTorch DDP tutorial pattern, not a manually stitched set of rank-local captures.

This workflow was validated against:

  • PyTorch DDP tutorial: https://docs.pytorch.org/tutorials/intermediate/ddp_tutorial.html

  • PyTorch tutorial-series example repo: https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series

  • local adaptation: examples.scenarios.torchrun_ddp_reference

Validated environment:

  • Jarvis container instance

  • 2xL4

  • PyTorch template

  • single node with torchrun

Assume the source checkout is already present on the instance at /home/gpu-memory-profiler.

jl create \
  --gpu L4 \
  --num-gpus 2 \
  --template pytorch \
  --region IN2 \
  --storage 80 \
  --name stormlog-ddp-reference \
  --yes \
  --json

Prepare the project environment on the instance:

cd /home/gpu-memory-profiler
python3 -m venv --system-site-packages .venv
. .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .

Run the reference training job with torchrun:

cd /home/gpu-memory-profiler
. .venv/bin/activate
mkdir -p artifacts/jarvis_torchrun_reference
PYTHONPATH=/home/gpu-memory-profiler \
OMP_NUM_THREADS=1 \
python -m torch.distributed.run \
  --nnodes=1 \
  --nproc_per_node=2 \
  --master_addr=127.0.0.1 \
  --master_port=29501 \
  -m examples.scenarios.torchrun_ddp_reference \
  --epochs 2 \
  --batch-size 128 \
  --dataset-size 4096 \
  --interval 0.1 \
  --job-id jarvis-torchrun-reference \
  --output-dir artifacts/jarvis_torchrun_reference \
  2>&1 | tee artifacts/jarvis_torchrun_reference/run.log

Expected console output shape:

  • one loss line per rank per epoch

  • rank-local and global loss values

  • a final Reference summary saved to .../ddp_reference_summary.json

Expected artifacts:

  • artifacts/jarvis_torchrun_reference/ddp_reference_summary.json

  • artifacts/jarvis_torchrun_reference/rank0/telemetry_sink/

  • artifacts/jarvis_torchrun_reference/rank1/telemetry_sink/

  • artifacts/jarvis_torchrun_reference/reference_checkpoint.pt

  • artifacts/jarvis_torchrun_reference/run.log

The validated reference run produced:

  • world_size = 2

  • rank_summaries = 2

  • 12 sample events per rank

  • 133 phase_enter and 133 phase_exit events per rank

Download the artifact root locally, then load both rank sinks together in the TUI Diagnostics tab:

stormlog

Then:

  1. Open Diagnostics.

  2. Enter the two sink paths as a comma-separated list.

  3. Click Load Artifacts.

  4. Leave session selection on auto.

  5. Confirm present_ranks shows 0,1.

For this reference run, Diagnostics selected a merged synthetic session and reported:

  • present_ranks = [0, 1]

  • expected_ranks = [0, 1]

  • missing_ranks = []

  • one diagnostics row per rank

Recipe: capture rank-aware PyTorch artifacts

gpumemprof track \
  --duration 30 \
  --interval 0.5 \
  --job-id train-42 \
  --rank 0 \
  --local-rank 0 \
  --world-size 2 \
  --output ./rank0.json \
  --format json
gpumemprof track \
  --duration 30 \
  --interval 0.5 \
  --job-id train-42 \
  --rank 1 \
  --local-rank 1 \
  --world-size 2 \
  --output ./rank1.json \
  --format json

Recipe: capture rank-aware TensorFlow artifacts

tfmemprof track \
  --interval 0.5 \
  --threshold 4096 \
  --device /CPU:0 \
  --job-id train-42 \
  --rank 0 \
  --local-rank 0 \
  --world-size 2 \
  --output ./tf_rank0.json
tfmemprof track \
  --interval 0.5 \
  --threshold 4096 \
  --device /CPU:0 \
  --job-id train-42 \
  --rank 1 \
  --local-rank 1 \
  --world-size 2 \
  --output ./tf_rank1.json

Stop each TensorFlow rank cleanly with Ctrl+C after tracking has started so the per-rank output file is flushed before exit.

Keep the same job_id across every rank-local capture from one distributed run. Diagnostics uses that shared job identity to auto-select a merged cross-rank session when you load the artifacts together.

Recipe: load multiple rank artifacts in the TUI

stormlog

Then:

  1. Open Diagnostics.

  2. Enter the artifact paths as a comma-separated list.

  3. Click Load Artifacts.

  4. Leave session selection on auto or default first. With a shared job_id, Diagnostics selects the merged cross-rank session automatically.

  5. Choose an individual session_id only when you want to isolate one raw rank-local artifact.

  6. Apply a rank filter such as all or 0,1.

What to look for

  • cross_rank_analysis in PyTorch optimization reports when more than one rank is present

  • rank-aware timeline differences in the TUI diagnostics pane

  • collective_attribution when communication phases align with hidden-memory spikes

  • session separation by session_id when the same sink directory or host is reused

What to do next

  • If one rank is the first cause, isolate that rank’s artifact and analyze it independently.

  • If all ranks spike together and collective_attribution is populated, treat the issue as a communication or synchronization candidate before changing model code.

  • If the problem is operational rather than rank-local, move to Always-on Tracking.

Troubleshooting

Symptom: only one rank appears in diagnostics

Likely cause: the wrong artifact set or session was loaded. Fix: load every rank artifact together, then target the intended session before refreshing. Verify: present_ranks matches the expected rank set.

Symptom: ranks are present but the first cause is unclear

Likely cause: the issue is synchronized across ranks. Fix: inspect cross_rank_analysis and collective_attribution before isolating one rank. Verify: the next action comes from a rank-local or communication-attributed explanation, not guesswork.


← Back to Production Cookbook