[← Back to Production Cookbook](index.md)

# Always-on Tracking

Use this recipe when you want Stormlog to behave like an operational service:
bounded history in memory, append-only sink artifacts on disk, and enough
session metadata to reconstruct one run later.

Audience: operators, platform owners.
Difficulty: intermediate.

## Prerequisites

- install the package first with [Installation](../installation.md)
- use `pip install "stormlog[torch]"` for `gpumemprof track`
- use `pip install "stormlog[tf]"` for `tfmemprof track`
- use [Command Line Guide](../cli.md) if you need per-flag reference
- a writable artifact directory for sink files
- enough runtime permissions to inspect the target device backend

Success signal:

- a sink manifest is created
- `analyze` can reload the sink
- collector health and retention counters are visible in the output

## When this is the right recipe

- you want long-running `track` sessions
- you need rollover and retention limits on artifacts
- you want the run to stay alive when a collector becomes unhealthy
- you want a stable path from live telemetry to later analysis or TUI loading

## PyTorch always-on baseline

```bash
gpumemprof track \
  --interval 0.5 \
  --warning-threshold 75 \
  --critical-threshold 90 \
  --telemetry-sink-dir ./live_sink \
  --telemetry-flush-seconds 2.0 \
  --telemetry-rollover-mb 64 \
  --telemetry-retention-files 8 \
  --telemetry-retention-total-mb 512
```

What this gives you:

- append-only JSONL sink segments plus a manifest
- one session identity for the run
- rollover and pruning under a bounded artifact budget
- `collector_degraded` and `collector_recovered` events instead of synthetic zero samples

## Add workload phases when timestamps are not enough

Use structured phases when you want long-running artifacts to answer what part
of the workload was active when a hidden-memory anomaly appeared.

```python
from stormlog import MemoryTracker

tracker = MemoryTracker(
    sampling_interval=0.5,
    # telemetry_sink_config=...,  # Optional: configure the append-only sink.
)

tracker.start_tracking()

for epoch in range(num_epochs):
    with tracker.phase("train", metadata={"epoch": epoch}):
        with tracker.phase("load_batch"):
            batch = next(loader)
        with tracker.phase("forward"):
            loss = model(batch).sum()
        with tracker.phase("backward"):
            loss.backward()
        with tracker.phase("optimizer_step"):
            optimizer.step()

tracker.stop_tracking()
```

What changes when phases are present:

- `track` writes companion `phase_enter` / `phase_exit` records with deterministic nested paths
- `gpumemprof analyze` adds phase-aware summaries beside timestamps
- the TUI Diagnostics tab shows the first anomaly phase path for each rank
- when you omit instrumentation entirely, the same workflow stays valid and low-overhead

## TensorFlow always-on baseline

```bash
tfmemprof track \
  --interval 1.0 \
  --threshold 4096 \
  --device /CPU:0 \
  --output ./tf_track.json \
  --telemetry-sink-dir ./tf_live_sink \
  --telemetry-flush-seconds 2.0 \
  --telemetry-rollover-mb 64 \
  --telemetry-retention-files 8 \
  --telemetry-retention-total-mb 512
```

Use `/GPU:0` instead of `/CPU:0` when the TensorFlow runtime has a GPU device
available.
Stop the command cleanly with `Ctrl+C` when you want it to flush the final
output file and session summary.

## Inspect the latest clean session

```bash
gpumemprof analyze ./live_sink --format txt --output ./live_analysis.txt
tfmemprof analyze --input ./tf_track.json --detect-leaks --optimize --report ./tf_report.txt
```

For PyTorch sink directories, default session selection prefers the newest clean
completed session and falls back to interrupted or incomplete sessions only when
needed.

## What to watch during long runs

Treat these values as operational signals, not just debug trivia:

- `rollover_count`
- `pruned_segment_count`
- `pruned_bytes`
- `final_retained_files`
- `final_retained_bytes`
- `history_retained_*`
- `history_dropped_*`
- `collector_failure_event_count`
- `session_status`

## How to interpret degraded mode

If the collector becomes unhealthy during `track`:

- the process keeps running
- new sample emission pauses until recovery
- status events remain visible in the artifact stream
- the final report should still show the collector-health transition history

Treat either of these as actionable:

- non-zero `collector_failure_event_count`
- any final collector state other than `healthy`

## Troubleshooting

### Symptom: sink files grow too quickly

Likely cause: retention is too loose for the deployment budget.
Fix: tighten retention and rollover settings before lowering sample fidelity.
Verify: `final_retained_*`, `pruned_*`, and `rollover_count` stabilize.

### Symptom: tracking stays alive but live telemetry looks partial

Likely cause: the collector entered degraded mode.
Fix: inspect `collector_failure_event_count` and the emitted status events.
Verify: `collector_health_status` returns to `healthy` and sampling resumes.

### Symptom: analysis loads the wrong run from a reused sink directory

Likely cause: more than one session is present in the sink.
Fix: inspect the discovered sessions and target the session you want explicitly.
Verify: the selected session id matches the intended run metadata.

## What to do next

- If the artifact budget is too high, tighten retention before you lower the
  sampling interval.
- If `collector_failure_event_count` is non-zero, move to the
  [Incident Playbooks](incidents.md) degraded-collector checklist.
- If the run needs to be qualified for CI or release use, move to the
  [CI and Release Qualification](ci_release.md) harness workflow.
- If the next question is rank-aware diagnosis, move to the
  [Distributed Diagnostics Recipes](distributed.md).

---

[← Back to Production Cookbook](index.md)