Always-on Tracking
Use this recipe when you want Stormlog to behave like an operational service: bounded history in memory, append-only sink artifacts on disk, and enough session metadata to reconstruct one run later.
Audience: operators, platform owners. Difficulty: intermediate.
Prerequisites
install the package first with Installation
use
pip install "stormlog[torch]"forgpumemprof trackuse
pip install "stormlog[tf]"fortfmemprof trackuse Command Line Guide if you need per-flag reference
a writable artifact directory for sink files
enough runtime permissions to inspect the target device backend
Success signal:
a sink manifest is created
analyzecan reload the sinkcollector health and retention counters are visible in the output
When this is the right recipe
you want long-running
tracksessionsyou need rollover and retention limits on artifacts
you want the run to stay alive when a collector becomes unhealthy
you want a stable path from live telemetry to later analysis or TUI loading
PyTorch always-on baseline
gpumemprof track \
--interval 0.5 \
--warning-threshold 75 \
--critical-threshold 90 \
--telemetry-sink-dir ./live_sink \
--telemetry-flush-seconds 2.0 \
--telemetry-rollover-mb 64 \
--telemetry-retention-files 8 \
--telemetry-retention-total-mb 512
What this gives you:
append-only JSONL sink segments plus a manifest
one session identity for the run
rollover and pruning under a bounded artifact budget
collector_degradedandcollector_recoveredevents instead of synthetic zero samples
Add workload phases when timestamps are not enough
Use structured phases when you want long-running artifacts to answer what part of the workload was active when a hidden-memory anomaly appeared.
from stormlog import MemoryTracker
tracker = MemoryTracker(
sampling_interval=0.5,
# telemetry_sink_config=..., # Optional: configure the append-only sink.
)
tracker.start_tracking()
for epoch in range(num_epochs):
with tracker.phase("train", metadata={"epoch": epoch}):
with tracker.phase("load_batch"):
batch = next(loader)
with tracker.phase("forward"):
loss = model(batch).sum()
with tracker.phase("backward"):
loss.backward()
with tracker.phase("optimizer_step"):
optimizer.step()
tracker.stop_tracking()
What changes when phases are present:
trackwrites companionphase_enter/phase_exitrecords with deterministic nested pathsgpumemprof analyzeadds phase-aware summaries beside timestampsthe TUI Diagnostics tab shows the first anomaly phase path for each rank
when you omit instrumentation entirely, the same workflow stays valid and low-overhead
TensorFlow always-on baseline
tfmemprof track \
--interval 1.0 \
--threshold 4096 \
--device /CPU:0 \
--output ./tf_track.json \
--telemetry-sink-dir ./tf_live_sink \
--telemetry-flush-seconds 2.0 \
--telemetry-rollover-mb 64 \
--telemetry-retention-files 8 \
--telemetry-retention-total-mb 512
Use /GPU:0 instead of /CPU:0 when the TensorFlow runtime has a GPU device
available.
Stop the command cleanly with Ctrl+C when you want it to flush the final
output file and session summary.
Inspect the latest clean session
gpumemprof analyze ./live_sink --format txt --output ./live_analysis.txt
tfmemprof analyze --input ./tf_track.json --detect-leaks --optimize --report ./tf_report.txt
For PyTorch sink directories, default session selection prefers the newest clean completed session and falls back to interrupted or incomplete sessions only when needed.
What to watch during long runs
Treat these values as operational signals, not just debug trivia:
rollover_countpruned_segment_countpruned_bytesfinal_retained_filesfinal_retained_byteshistory_retained_*history_dropped_*collector_failure_event_countsession_status
How to interpret degraded mode
If the collector becomes unhealthy during track:
the process keeps running
new sample emission pauses until recovery
status events remain visible in the artifact stream
the final report should still show the collector-health transition history
Treat either of these as actionable:
non-zero
collector_failure_event_countany final collector state other than
healthy
Troubleshooting
Symptom: sink files grow too quickly
Likely cause: retention is too loose for the deployment budget.
Fix: tighten retention and rollover settings before lowering sample fidelity.
Verify: final_retained_*, pruned_*, and rollover_count stabilize.
Symptom: tracking stays alive but live telemetry looks partial
Likely cause: the collector entered degraded mode.
Fix: inspect collector_failure_event_count and the emitted status events.
Verify: collector_health_status returns to healthy and sampling resumes.
Symptom: analysis loads the wrong run from a reused sink directory
Likely cause: more than one session is present in the sink. Fix: inspect the discovered sessions and target the session you want explicitly. Verify: the selected session id matches the intended run metadata.
What to do next
If the artifact budget is too high, tighten retention before you lower the sampling interval.
If
collector_failure_event_countis non-zero, move to the Incident Playbooks degraded-collector checklist.If the run needs to be qualified for CI or release use, move to the CI and Release Qualification harness workflow.
If the next question is rank-aware diagnosis, move to the Distributed Diagnostics Recipes.