[← Back to main docs](index.md)

# Command Line Guide

Stormlog currently exposes three console scripts:

- `gpumemprof`
- `tfmemprof`
- `jaxmemprof`
- `stormlog`

Use `gpumemprof` and `tfmemprof` for automation. Use `stormlog` when you want the Textual TUI.

If you want task-oriented operational recipes instead of option-by-option
guidance, use the [Production Cookbook](cookbook/index.md), especially
[Always-on Tracking](cookbook/always_on.md),
[Incident Playbooks](cookbook/incidents.md), and
[Distributed Diagnostics Recipes](cookbook/distributed.md).

## Verify the installed commands

```bash
gpumemprof --help
tfmemprof --help
jaxmemprof --help
```

If you are working from a repository checkout, `pip install -e .` also exposes
the source-only `examples/` package used in a few release-validation flows.

Install and launch the TUI with the current dependency set:

```bash
pip install "stormlog[tui,torch]"
stormlog
```

The `stormlog` command is also a small dispatcher. Running it without
arguments still launches the TUI, while `stormlog query ...` runs the local
artifact query CLI without importing Textual.

## `gpumemprof`

The current command groups are:

- `info`
- `monitor`
- `track`
- `analyze`
- `diagnose`

### Inspect environment

```bash
gpumemprof info
gpumemprof info --device 0 --detailed
```

`gpumemprof info` still reports the active PyTorch runtime first. When no
supported PyTorch GPU runtime is active, it now falls back to a best-effort host
GPU hardware probe so the command can show detected device names separately from
runtime availability. Supported PyTorch GPU runtimes remain NVIDIA CUDA, AMD
ROCm-backed PyTorch on Linux, and Apple MPS. In that unsupported-runtime mode,
`--device` is ignored because it only applies to an active PyTorch GPU runtime.

### Capture a bounded monitoring window

```bash
gpumemprof monitor --duration 30 --interval 0.5 --output monitor.csv --format csv
gpumemprof monitor --duration 30 --interval 0.5 --output monitor.json --format json
```

### Track events over time

```bash
gpumemprof track --duration 30 --interval 0.5 --output track.json --format json
gpumemprof track --warning-threshold 75 --critical-threshold 90 --output alerts.csv
gpumemprof track --job-id train-42 --rank 1 --local-rank 1 --world-size 8 --output rank1.json --format json
gpumemprof track --telemetry-sink-dir ./live_sink --telemetry-rollover-mb 32 --telemetry-retention-total-mb 256
```

Every `gpumemprof track` run now creates exactly one session identity. The
session begins after tracker startup succeeds and before the first record is
persisted, and it is marked `completed` only after clean shutdown finalization
finishes.

If your Python workload instruments phases with `tracker.phase(...)` or
`tracker.enter_phase(...)`, `track` persists the emitted `phase_enter` /
`phase_exit` records alongside the regular telemetry samples. The CLI does not
invent phase records on its own; it only preserves the structured phase events
your workload emitted. Phase records remain optional and do not change the
`track` CLI surface in v1.

For long-running tracking sessions, Stormlog now degrades gracefully when a
collector becomes unhealthy:

- the tracked workload keeps running
- exported telemetry includes `collector_degraded` / `collector_recovered` events
- per-event metadata marks partial or unhealthy collector state
- retries use bounded exponential backoff instead of crashing the tracker loop

For always-on sessions, `gpumemprof track` can also stream append-only
telemetry into a sink directory during the run instead of waiting for shutdown.
The sink writes JSONL segments plus a manifest, rolls segments when they hit the
configured size limit, and prunes the oldest closed segments to stay within the
retention budget.

The sink manifest also keeps a session ledger, so multiple runs can safely
share the same sink directory without merging captures. If a previous run was
still `running` when the process died, the next startup recovers it as
`interrupted` and starts a fresh session for the new run.

Useful sink options:

- `--telemetry-sink-dir`
- `--telemetry-flush-seconds`
- `--telemetry-rollover-mb`
- `--telemetry-retention-files`
- `--telemetry-retention-total-mb`

The always-on qualification harness assumes the default sink settings:

- flush every `2.0s`
- roll segments at `64 MB`
- retain at most `8` files
- retain at most `512 MB` total

When you inspect `track` output after a long run, look for these diagnostics in
the JSON or CLI summary:

- `rollover_count`
- `pruned_segment_count`
- `pruned_bytes`
- `final_retained_files`
- `final_retained_bytes`
- `history_retained_*`
- `history_dropped_*`

Optional OOM flight-recorder support:

```bash
gpumemprof track \
  --oom-flight-recorder \
  --oom-dump-dir ./oom_dumps \
  --oom-max-dumps 10 \
  --oom-max-total-mb 1024 \
  --output track.json --format json
```

### Analyze saved telemetry

```bash
gpumemprof analyze track.json --format txt --output analysis.txt
gpumemprof analyze track.json --visualization --plot-dir plots
gpumemprof analyze ./live_sink
gpumemprof analyze ./live_sink --session-id 2b30f4a4-7d2d-48f7-a9f6-7d40c14eb95e
```

`gpumemprof analyze` uses a positional input file. It can now read a normal JSON
telemetry export, a sink JSONL segment, or a sink directory containing the
current and rolled append-only outputs. If you add `--visualization`, plots are
written to the directory passed via `--plot-dir` or to `plots/` by default.

When the telemetry stream includes structured phase boundaries, the text summary
also includes phase-aware hints such as:

- `Top gap phase: train / forward`
- `Suspect phase: train / communication`

When Stormlog cannot prove a unique phase but can still surface a useful winner,
the summary uses a heuristic marker instead of pretending certainty:

- `Top gap phase: (likely) train / communication`

In JSON report payloads, that distinction is preserved as:

- canonical `phase_attribution.phase_resolution`
- canonical `phase_attribution.phase_source`
- optional `phase_attribution.phase_summary` only when the displayed winner is heuristic

When multiple sessions are present, `gpumemprof analyze` selects:

1. the newest `completed` session
2. otherwise the newest `interrupted` session
3. otherwise the newest `incomplete` session

Use `--session-id` to analyze a specific capture instead of the default one.

When phase records are present, `gpumemprof analyze` also reports the top
phase-attributed gap finding and the top first-cause suspect phase in the text
summary. The JSON report keeps the structured `phase_attribution` payload next
to:

- `gap_analysis`
- `collective_attribution`
- `cross_rank_analysis.first_cause_suspects`

For always-on deployment posture and incident response checklists, continue with
[Always-on Tracking](cookbook/always_on.md) and
[Incident Playbooks](cookbook/incidents.md).

### Produce a diagnose bundle

```bash
gpumemprof diagnose --duration 5 --interval 0.5 --output ./diag_bundle
gpumemprof diagnose --duration 0 --output ./diag_bundle_quick
gpumemprof diagnose --native-history --duration 0 --output ./diag_bundle_native
```

Use `--duration 0` when you want a fast artifact bundle without a new tracking window.

For task-oriented recipes that combine `track`, `analyze`, `diagnose`, and the
TUI into one workflow, continue with
[Incident Playbooks](cookbook/incidents.md) and
[Distributed Diagnostics Recipes](cookbook/distributed.md).

Each standalone diagnose bundle also owns its own session id. The bundle
manifest records whether the run finished `completed` or was left
`incomplete`, and synthesized timeline telemetry inherits that same session id
when reloaded later.

`--native-history` is a CUDA-only debug mode. It records allocator history for
the current `gpumemprof diagnose` process, then writes native snapshot artifacts
such as `cuda_allocator_snapshot.pickle`,
`cuda_allocator_state_history.html`,
`cuda_allocator_state_history_annotated.html`, and tensor-attribution JSON
alongside the normal diagnose bundle files. The annotated HTML is the
Stormlog-native view that exposes the timeline trace, segment explorer, and
active-memory table in one file. For a maintained workflow example of that
artifact, continue with [PyTorch Production Recipes](cookbook/pytorch.md). On
MPS, ROCm, or CPU-only runtimes, the command fails explicitly instead of
pretending support.

## `stormlog query`

`stormlog query` asks structured questions over local artifact directories. It
uses the same canonical telemetry loaders as `gpumemprof analyze`, but exposes
rows that are easier to filter, export, and reuse from automation.

The query surface is local-first and file-backed. It reads sink manifests and
bundle manifests before loading raw events, so listing sessions or OOM bundles
does not require parsing every JSONL segment in a large sink directory.

List sessions:

```bash
stormlog query sessions ./live_sink --status interrupted --json
stormlog query sessions ./artifacts --has-oom-bundle --table
```

Query events:

```bash
stormlog query events ./live_sink \
  --session-id 2b30f4a4-7d2d-48f7-a9f6-7d40c14eb95e \
  --rank 0 \
  --event-type collector_degraded \
  --limit 50
```

List OOM bundles:

```bash
stormlog query ooms ./artifacts --backend cuda --table
stormlog query ooms ./artifacts --created-after 2026-05-12T00:00:00Z --json
```

List grouped recurring issues:

```bash
stormlog query issues ./live_sink ./oom_dumps --kind oom --json
stormlog query issues ./artifacts --severity warning --session-id session-123
```

Run built-in summaries:

```bash
stormlog query summary ./live_sink \
  --metric peak_allocator_reserved_bytes \
  --group-by session

stormlog query summary ./live_sink \
  --metric hidden_memory_gap_growth \
  --group-by session-rank
```

Supported output formats:

- `--table`: readable table output, used by default
- `--json`: machine-readable rows
- `--csv`: row-query exports for `sessions`, `events`, and `ooms`

The Python API behind the CLI is available as:

```python
import stormlog.query

store = stormlog.query.open(["./live_sink", "./oom_dumps"])
sessions = store.list_sessions()
events = store.query_events()
ooms = store.list_oom_bundles()
issues = store.list_issues()
```

For engine-choice details and follow-on work, see
[Local Query Layer](query_layer.md).
For issue grouping rules and schema details, see
[Durable Issue Fingerprinting](issue_fingerprinting.md).

## `tfmemprof`

The current command groups are:

- `info`
- `monitor`
- `track`
- `analyze`
- `diagnose`

### Inspect environment

```bash
tfmemprof info
```

### Monitor TensorFlow memory usage

```bash
tfmemprof monitor --interval 0.5 --duration 30 --output tf_monitor.json
tfmemprof monitor --interval 0.5 --duration 30 --threshold 4096 --device /GPU:0 --output tf_monitor_threshold.json
```

For CPU-only TensorFlow or when the GPU backend is unavailable, use
`--device /CPU:0`:

```bash
tfmemprof monitor --interval 0.5 --duration 30 --device /CPU:0 --output tf_monitor.json
tfmemprof track --interval 0.5 --threshold 4096 --device /CPU:0 --output tf_track.json
```

### Track TensorFlow memory usage

```bash
tfmemprof track --interval 0.5 --threshold 4096 --output tf_track.json
tfmemprof track --interval 0.5 --threshold 4096 --job-id train-42 --rank 3 --local-rank 1 --world-size 8 --output tf_rank3.json
tfmemprof track --interval 0.5 --threshold 4096 --output tf_track.json --telemetry-sink-dir ./tf_live_sink
```

`tfmemprof track` follows the same degraded-mode rules as the PyTorch tracker:
collector failures pause new sample emission, status events remain visible in the
artifact stream, and normal sampling resumes automatically after recovery.
The same append-only sink options are available when you need bounded,
interrupt-tolerant TensorFlow telemetry during a long-running session.

TensorFlow tracking also keeps only a bounded recent history in memory. The
current CLI output and JSON exports surface the retained vs dropped sample,
event, and alert counts so long-running jobs can distinguish expected eviction
from a silent memory-growth regression.

The same session rules apply to TensorFlow tracking:

- one session id per `tfmemprof track` run
- sink recovery marks old running sessions as `interrupted`
- loaders and diagnostics separate same-host runs by `session_id`, not by job or rank alone

Like the PyTorch and CPU trackers, TensorFlow tracking also preserves optional
structured phase companion records when you instrument the tracker through the
Python API.

### Analyze TensorFlow results

```bash
tfmemprof analyze --input tf_monitor.json --detect-leaks --optimize
tfmemprof analyze --input tf_monitor.json --detect-leaks --optimize --visualize --report tf_report.txt
```

Unlike `gpumemprof analyze`, the TensorFlow analyzer uses `--input`.

### Produce a diagnose bundle

```bash
tfmemprof diagnose --duration 5 --interval 0.5 --output ./tf_diag
tfmemprof diagnose --duration 0 --output ./tf_diag_quick
```

## `jaxmemprof`

The current command groups are:

- `info`
- `monitor`
- `track`
- `analyze`
- `diagnose`

### Inspect environment

```bash
jaxmemprof info
```

### Monitor JAX memory usage

```bash
jaxmemprof monitor --interval 0.5 --duration 30 --output jax_monitor.json
jaxmemprof monitor --interval 0.5 --duration 30 --device gpu --output jax_monitor_gpu.json
```

For CPU-only JAX execution or when accelerators are unavailable, use `--device cpu`:

```bash
jaxmemprof monitor --interval 0.5 --duration 30 --device cpu --output jax_monitor.json
jaxmemprof track --interval 0.5 --device cpu --output jax_track.json
```

### Track JAX memory usage

```bash
jaxmemprof track --interval 0.5 --output jax_track.json
jaxmemprof track --interval 0.5 --job-id train-42 --rank 2 --local-rank 0 --world-size 8 --output jax_rank2.json
jaxmemprof track --interval 0.5 --output jax_track.json --telemetry-sink-dir ./jax_live_sink
```

`jaxmemprof track` shares the same robust, degraded-mode semantics as the PyTorch and TensorFlow trackers, allowing it to gracefully handle long-running runs, collector interruptions, and append-only sink persistence.

JAX tracking also logs structured phase boundaries when using `jaxmemprof.MemoryTracker` instrumentation and emits telemetry streams that are compatible with `gpumemprof analyze` and the Textual TUI.

### Analyze JAX results

```bash
jaxmemprof analyze --input jax_monitor.json --detect-leaks --optimize
jaxmemprof analyze --input jax_monitor.json --detect-leaks --optimize --visualize --report jax_report.txt
```

### Produce a diagnose bundle

```bash
jaxmemprof diagnose --duration 5 --interval 0.5 --output ./jax_diag
jaxmemprof diagnose --duration 0 --output ./jax_diag_quick
```

## TUI launch

```bash
pip install "stormlog[tui,torch]"
stormlog
```

Inside the TUI, the `CLI & Actions` tab exposes quick actions for:

- `gpumemprof info`
- `gpumemprof monitor`
- `tfmemprof monitor`
- `jaxmemprof monitor`
- `gpumemprof diagnose`
- sample workloads
- OOM scenario runner
- capability matrix smoke run

## Release-validation shortcuts

**Source checkout only.** These commands require the `examples/` package from a
repository clone:

```bash
python -m examples.cli.quickstart
python -m examples.cli.capability_matrix --mode smoke --target both --oom-mode simulated
```

**Pip users** should use this CLI-only sequence instead:

```bash
gpumemprof info
gpumemprof track --duration 2 --interval 0.5 --output track.json --format json
gpumemprof analyze track.json --format txt --output analysis.txt
gpumemprof diagnose --duration 0 --output ./diag

tfmemprof info
tfmemprof diagnose --duration 0 --output ./tf_diag

jaxmemprof info
jaxmemprof diagnose --duration 0 --output ./jax_diag
```

## Choosing the right command

### Use `monitor` when

- you want a bounded sample window
- you only need a simple CSV or JSON output

### Use `track` when

- you want event streams and alert thresholds
- you want later exports or distributed identity fields
- you want a reconstructable session that owns its sink, diagnose, or OOM artifacts

### Use `analyze` when

- you already have saved telemetry
- you want a report or plot output

### Use `diagnose` when

- you need a portable artifact bundle to archive or share
- you plan to inspect the output later in the TUI diagnostics flow

---

[← Back to main docs](index.md)