Architecture Guide
This page describes the current code-level architecture of Stormlog. It is a source-of-truth guide for how the repo is organized today, not a roadmap.
Repository surfaces
Stormlog has three user-facing surfaces that share the same core data model:
Python APIs for bounded profiling and background tracking
CLI entrypoints for capture, analysis, and diagnose flows
a Textual TUI for live monitoring, visualization export, and artifact review
Those surfaces are implemented under one package root:
stormlogfor PyTorch, CPU fallback utilities, telemetry normalization, and the TUIstormlog.tensorflowfor TensorFlow profiling, tracking, and TensorFlow-specific analysis helpers
Package boundaries
stormlog
The stormlog package owns:
GPUMemoryProfilerfor bounded PyTorch profilingMemoryTrackerfor time-based tracking on CUDA, ROCm, MPS, or CPU fallback pathsCPUMemoryProfilerandCPUMemoryTrackerfor CPU-only workflowsMemoryVisualizerfor PNG, HTML, heatmap, and dashboard-style exportsMemoryAnalyzer,GapFinding, and collective-attribution helpersTelemetryEventV2plus telemetry conversion and validation utilitiesdevice collector abstractions in
device_collectors.pythe Textual TUI under
stormlog.tui
stormlog.tensorflow
The stormlog.tensorflow subpackage owns:
TFMemoryProfilerfor bounded TensorFlow profilingTensorFlowProfilerandProfiledLayerincontext_profiler.pyTensorFlowMemoryTracker(an exported alias ofstormlog.tensorflow.tracker.MemoryTracker)TensorFlowVisualizerTensorFlowAnalyzerandTensorFlowGapFindingTensorFlow runtime and backend diagnostics in
stormlog.tensorflow.utils
The TensorFlow package does not ship a separate TUI. The shared terminal UI is the stormlog entrypoint implemented in stormlog.tui.
High-level layering
User code / shell
|
+-- Python APIs
| +-- stormlog.GPUMemoryProfiler
| +-- stormlog.MemoryTracker / CPUMemoryTracker
| +-- stormlog.tensorflow.TFMemoryProfiler
| +-- stormlog.tensorflow.TensorFlowMemoryTracker
|
+-- CLI entrypoints
| +-- gpumemprof
| +-- tfmemprof
| +-- stormlog
|
+-- Shared artifact layer
+-- TelemetryEventV2 JSON/CSV exports
+-- diagnose bundles
+-- PNG / HTML visualization outputs
Core modules and responsibilities
Profilers
Bounded profilers are for “what happened inside this call or context?” questions.
stormlog.profiler.GPUMemoryProfilerstormlog.cpu_profiler.CPUMemoryProfilerstormlog.tensorflow.profiler.TFMemoryProfiler
They expose:
profile_function(...)profile_context(...)summary/result accessors such as
get_summary()orget_results()optional live monitoring helpers such as
start_monitoring(...)
Trackers
Trackers are for “what happened over time?” questions.
stormlog.tracker.MemoryTrackerstormlog.cpu_profiler.CPUMemoryTrackerstormlog.tensorflow.tracker.MemoryTrackerexported asTensorFlowMemoryTracker
Trackers are responsible for:
background sampling
event generation
threshold-triggered alerts
timeline aggregation
exportable telemetry events
Telemetry
stormlog.telemetry is the shared interchange layer used by trackers, CLI tools, diagnostics, and the TUI.
Key responsibilities:
normalize legacy records into canonical
TelemetryEventV3validate event shape
load saved event streams from disk
group saved artifacts into session-aware capture units
resolve distributed identity defaults from environment variables or explicit inputs
stormlog.session is the shared lifecycle contract used by trackers,
append-only sinks, diagnose bundles, OOM bundles, CLI analysis, and TUI
diagnostics.
It defines:
unique
session_idgenerationlifecycle states:
running,completed,interrupted,incompletesession summaries with host, pid, distributed identity, and source metadata
default session selection order for multi-session artifact roots
This shared schema is what allows Stormlog tracker exports, TensorFlow tracker exports, diagnose bundles, and TUI diagnostics loading to operate on the same underlying event model.
Device collectors
stormlog.device_collectors is the backend-aware abstraction for PyTorch-side device memory sampling.
Current collector contract:
sample()returns a normalizedDeviceMemorySamplecapabilities()reports backend metadata such assupports_device_totalname()identifies the runtime backend (cuda,rocm,mps)
Current concrete collectors:
CudaDeviceCollectorROCmDeviceCollectorMPSDeviceCollector
Analyzers
Analyzers turn raw or normalized memory data into higher-level findings.
stormlog.analyzer.MemoryAnalyzerstormlog.tensorflow.analyzer.MemoryAnalyzergap-analysis and collective-attribution helpers in
stormlogcommon metric formulas centralized in
stormlog.derived_fields
These modules power:
leak and growth heuristics
hidden-memory gap analysis
distributed diagnostics summaries
recommendation text in CLI or artifact flows
Visualizers
Visualizers convert profiler or tracker output into human-readable plots.
stormlog.visualizer.MemoryVisualizerstormlog.tensorflow.visualizer.MemoryVisualizer
The PyTorch-side visualizer also underpins the TUI plot export path for:
PNG timeline plots
HTML timeline plots
heatmaps
multi-panel dashboard exports
TUI architecture
The stormlog console script points to stormlog.tui:run_app.
The TUI is assembled from:
stormlog.tui.appfor the main Textual applicationstormlog.tui.monitor.TrackerSessionfor adapting tracker data into the UIstormlog.tui.distributed_diagnosticsfor artifact loading and rank-level summariesstormlog.tui.widgets.*for tables, panels, and timeline rendering
Current tabs are:
OverviewPyTorchTensorFlowMonitoringVisualizationsDiagnosticsCLI & Actions
The TUI is not a separate analysis engine. It reuses:
tracker sessions for live data
TelemetryEventV3 records for artifact loading
session-aware artifact loading so users can switch between discovered captures
MemoryVisualizer-style plot generation for PNG/HTML export
Main runtime flows
1. Bounded profiling flow
User code
-> profiler.profile_function(...) or profiler.profile_context(...)
-> framework/runtime-specific snapshots
-> in-memory result object
-> summary/report accessors
2. Tracking flow
Tracker start
-> create session summary
-> periodic sampling
-> alert evaluation
-> event storage
-> append-only sink / OOM artifacts reference session_id
-> statistics / timeline / export helpers
-> clean stop marks session completed
3. Diagnose flow
CLI diagnose
-> create session summary
-> runtime/system info
-> telemetry capture
-> manifest records session_id and terminal status
-> artifact bundle on disk
-> later reload in TUI Diagnostics or analyzer paths
4. TUI flow
TrackerSession or artifact path input
-> normalized telemetry events
-> timeline / rank-table rendering
-> optional PNG or HTML export
Configuration model
Stormlog configuration is currently local to:
constructor arguments
method parameters
CLI flags
distributed identity environment inference inside telemetry helpers
There is no repo-level persistent config file format today.
Error-handling model
The codebase prefers capability-gated behavior over silent fallback.
Examples:
PyTorch-specific APIs raise import/runtime errors when
torchis missingGPUMemoryProfileris for CUDA-backed profiling, while CPU-only workflows use separate CPU profiler classesTUI startup currently hard-imports
torch, sostormlogrequires the current TUI plus PyTorch dependency pathtelemetry loaders collect warnings when artifact payloads are malformed or incomplete
Test architecture
The repo test layout is split by behavior, not by package alone:
tests/
test_*.py # core, CLI, telemetry, analyzer, and framework tests
tui/ # Textual pilot and snapshot coverage
e2e/ # PTY smoke coverage
Current marker families used in the repo:
slowintegrationunittui_pilottui_ptytui_snapshot
The operational guide for running those slices lives in the Testing and Validation Guide.
Extensibility points that exist today
The repo currently exposes a few real extension seams:
backend collection through
DeviceMemoryCollectortelemetry normalization through
stormlog.telemetrynew CLI/documentation workflows through example modules and diagnose artifacts
new TUI tables or views through
stormlog.tui.widgets
Anything beyond those seams should be treated as new feature work, not assumed architecture.