Distributed Diagnostics Recipes

Use this page when you need to preserve per-rank identity during capture and rebuild a rank-aware timeline later.

Audience: distributed-training owners, incident responders. Difficulty: advanced.

Prerequisites

install the package first with Installation
use pip install "stormlog[torch]" for PyTorch rank capture
use pip install "stormlog[tf]" for TensorFlow rank capture
use pip install "stormlog[tui,torch]" if you want the TUI Diagnostics workflow from a pip install
use TUI Guide and Command Line Guide if you need UI or CLI reference details
the artifact paths for each rank are writable and distinct
rank metadata is either inferred from the environment or passed explicitly
the TUI extra is installed if you want the interactive Diagnostics workflow

Success signal:

each rank produces its own artifact with rank identity intact
the TUI Diagnostics tab can load multiple rank files without flattening them

Choose the distributed path

If the job is…	Start with…
PyTorch rank capture	rank-aware `gpumemprof track`
TensorFlow rank capture	rank-aware `tfmemprof track`
artifact triage after capture	`stormlog` Diagnostics tab

When this is the right recipe

you need one artifact per rank
you want job_id, rank, local_rank, and world_size recorded explicitly
you need TUI diagnostics to keep ranks separate instead of flattening them
you want hidden-memory-gap or collective-attribution analysis with more than one rank

Recipe: validate a reference `torchrun` DDP run on Jarvis

Use this when you want one real multi-GPU training run based on the official PyTorch DDP tutorial pattern, not a manually stitched set of rank-local captures.

This workflow was validated against:

PyTorch DDP tutorial: https://docs.pytorch.org/tutorials/intermediate/ddp_tutorial.html
PyTorch tutorial-series example repo: https://github.com/pytorch/examples/tree/main/distributed/ddp-tutorial-series
local adaptation: examples.scenarios.torchrun_ddp_reference

Validated environment:

Jarvis container instance
2xL4
PyTorch template
single node with torchrun

Assume the source checkout is already present on the instance at /home/gpu-memory-profiler.

jl create \
  --gpu L4 \
  --num-gpus 2 \
  --template pytorch \
  --region IN2 \
  --storage 80 \
  --name stormlog-ddp-reference \
  --yes \
  --json

Prepare the project environment on the instance:

cd /home/gpu-memory-profiler
python3 -m venv --system-site-packages .venv
. .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .

Run the reference training job with torchrun:

cd /home/gpu-memory-profiler
. .venv/bin/activate
mkdir -p artifacts/jarvis_torchrun_reference
PYTHONPATH=/home/gpu-memory-profiler \
OMP_NUM_THREADS=1 \
python -m torch.distributed.run \
  --nnodes=1 \
  --nproc_per_node=2 \
  --master_addr=127.0.0.1 \
  --master_port=29501 \
  -m examples.scenarios.torchrun_ddp_reference \
  --epochs 2 \
  --batch-size 128 \
  --dataset-size 4096 \
  --interval 0.1 \
  --job-id jarvis-torchrun-reference \
  --output-dir artifacts/jarvis_torchrun_reference \
  2>&1 | tee artifacts/jarvis_torchrun_reference/run.log

Expected console output shape:

one loss line per rank per epoch
rank-local and global loss values
a final Reference summary saved to .../ddp_reference_summary.json

Expected artifacts:

artifacts/jarvis_torchrun_reference/ddp_reference_summary.json
artifacts/jarvis_torchrun_reference/rank0/telemetry_sink/
artifacts/jarvis_torchrun_reference/rank1/telemetry_sink/
artifacts/jarvis_torchrun_reference/reference_checkpoint.pt
artifacts/jarvis_torchrun_reference/run.log

The validated reference run produced:

world_size = 2
rank_summaries = 2
12 sample events per rank
133 phase_enter and 133 phase_exit events per rank

Download the artifact root locally, then load both rank sinks together in the TUI Diagnostics tab:

stormlog

Then:

Open Diagnostics.
Enter the two sink paths as a comma-separated list.
Click Load Artifacts.
Leave session selection on auto.
Confirm present_ranks shows 0,1.

For this reference run, Diagnostics selected a merged synthetic session and reported:

present_ranks = [0, 1]
expected_ranks = [0, 1]
missing_ranks = []
one diagnostics row per rank

Recipe: capture rank-aware PyTorch artifacts

gpumemprof track \
  --duration 30 \
  --interval 0.5 \
  --job-id train-42 \
  --rank 0 \
  --local-rank 0 \
  --world-size 2 \
  --output ./rank0.json \
  --format json

gpumemprof track \
  --duration 30 \
  --interval 0.5 \
  --job-id train-42 \
  --rank 1 \
  --local-rank 1 \
  --world-size 2 \
  --output ./rank1.json \
  --format json

Recipe: capture rank-aware TensorFlow artifacts

tfmemprof track \
  --interval 0.5 \
  --threshold 4096 \
  --device /CPU:0 \
  --job-id train-42 \
  --rank 0 \
  --local-rank 0 \
  --world-size 2 \
  --output ./tf_rank0.json

tfmemprof track \
  --interval 0.5 \
  --threshold 4096 \
  --device /CPU:0 \
  --job-id train-42 \
  --rank 1 \
  --local-rank 1 \
  --world-size 2 \
  --output ./tf_rank1.json

Stop each TensorFlow rank cleanly with Ctrl+C after tracking has started so the per-rank output file is flushed before exit.

Keep the same job_id across every rank-local capture from one distributed run. Diagnostics uses that shared job identity to auto-select a merged cross-rank session when you load the artifacts together.

Recipe: load multiple rank artifacts in the TUI

stormlog

Then:

Open Diagnostics.
Enter the artifact paths as a comma-separated list.
Click Load Artifacts.
Leave session selection on auto or default first. With a shared job_id, Diagnostics selects the merged cross-rank session automatically.
Choose an individual session_id only when you want to isolate one raw rank-local artifact.
Apply a rank filter such as all or 0,1.

What to look for

cross_rank_analysis in PyTorch optimization reports when more than one rank is present
rank-aware timeline differences in the TUI diagnostics pane
collective_attribution when communication phases align with hidden-memory spikes
session separation by session_id when the same sink directory or host is reused

What to do next

If one rank is the first cause, isolate that rank’s artifact and analyze it independently.
If all ranks spike together and collective_attribution is populated, treat the issue as a communication or synchronization candidate before changing model code.
If the problem is operational rather than rank-local, move to Always-on Tracking.

Troubleshooting

Symptom: only one rank appears in diagnostics

Likely cause: the wrong artifact set or session was loaded. Fix: load every rank artifact together, then target the intended session before refreshing. Verify: present_ranks matches the expected rank set.

Symptom: ranks are present but the first cause is unclear

Likely cause: the issue is synchronized across ranks. Fix: inspect cross_rank_analysis and collective_attribution before isolating one rank. Verify: the next action comes from a rank-local or communication-attributed explanation, not guesswork.

← Back to Production Cookbook