stormlog
Stormlog - A comprehensive memory profiling tool.
- class stormlog.GPUMemoryProfiler(device=None, track_tensors=True, track_cpu_memory=True, collect_stack_traces=False)[source]
Bases:
objectComprehensive GPU memory profiler for PyTorch operations.
- Parameters:
device (str | int | torch.device | None)
track_tensors (bool)
track_cpu_memory (bool)
collect_stack_traces (bool)
- profile_function(func, *args, **kwargs)[source]
Profile a single function call.
- Parameters:
func (Callable[[...], Any]) – Function to profile
*args (Any) – Arguments to pass to function
**kwargs (Any) – Keyword arguments to pass to function
- Returns:
ProfileResult with profiling information
- Return type:
- profile_context(name='context')[source]
Context manager for profiling a block of code.
- Parameters:
name (str) – Name for the profiled context
- Yields:
ProfileResult after the context exits
- Return type:
Any
- class stormlog.MemorySnapshot(timestamp, allocated_memory, reserved_memory, max_memory_allocated, max_memory_reserved, active_memory, inactive_memory, cpu_memory, device_id=0, operation=None, stack_trace=None)[source]
Bases:
objectRepresents a memory snapshot at a specific point in time.
- Parameters:
timestamp (float)
allocated_memory (int)
reserved_memory (int)
max_memory_allocated (int)
max_memory_reserved (int)
active_memory (int)
inactive_memory (int)
cpu_memory (int)
device_id (int)
operation (str | None)
stack_trace (str | None)
- timestamp: float
- allocated_memory: int
- reserved_memory: int
- max_memory_allocated: int
- max_memory_reserved: int
- active_memory: int
- inactive_memory: int
- cpu_memory: int
- device_id: int = 0
- operation: str | None = None
- stack_trace: str | None = None
- class stormlog.ProfileResult(function_name, execution_time, memory_before, memory_after, memory_peak, memory_allocated, memory_freed, tensors_created, tensors_deleted, call_count=1)[source]
Bases:
objectResults from profiling a function or operation.
- Parameters:
function_name (str)
execution_time (float)
memory_before (MemorySnapshot)
memory_after (MemorySnapshot)
memory_peak (MemorySnapshot)
memory_allocated (int)
memory_freed (int)
tensors_created (int)
tensors_deleted (int)
call_count (int)
- function_name: str
- execution_time: float
- memory_before: MemorySnapshot
- memory_after: MemorySnapshot
- memory_peak: MemorySnapshot
- memory_allocated: int
- memory_freed: int
- tensors_created: int
- tensors_deleted: int
- call_count: int = 1
- stormlog.profile_context(name='context', device=None, profiler=None)[source]
Context manager for profiling a block of code.
- Parameters:
name (str) – Name for the profiled context
device (str | int | torch.device | None) – GPU device to use for profiling
profiler (GPUMemoryProfiler | None) – Custom profiler instance to use
- Yields:
ProfileResult after the context exits
- Return type:
Iterator[GPUMemoryProfiler]
Example
- with profile_context(“model_forward”) as prof:
output = model(input)
- stormlog.profile_function(func=None, *, name=None, device=None, profiler=None)[source]
Decorator to profile a function’s GPU memory usage.
Can be used as @profile_function or @profile_function(name=”custom_name”)
- Parameters:
func (F | None) – Function to profile (when used as @profile_function)
name (str | None) – Custom name for the profiled function
device (str | int | torch.device | None) – GPU device to use for profiling
profiler (GPUMemoryProfiler | None) – Custom profiler instance to use
- Returns:
Decorated function or ProfileResult if called directly
- Return type:
Callable[[F], F] | F
- class stormlog.MemoryVisualizer(profiler=None)[source]
Bases:
objectComprehensive visualization tool for memory profiling data.
- Parameters:
profiler (GPUMemoryProfiler | None)
- plot_memory_timeline(results=None, snapshots=None, save_path=None, interactive=True)[source]
Plot memory usage over time.
- Parameters:
results (List[ProfileResult] | None) – List of ProfileResults to plot
snapshots (List[MemorySnapshot] | None) – List of MemorySnapshots to plot
save_path (str | None) – Path to save the plot
interactive (bool) – Whether to create interactive plot
- Returns:
Matplotlib or Plotly figure
- Return type:
matplotlib.pyplot.Figure | plotly.graph_objects.Figure
- plot_cross_rank_timeline(events, save_path=None)[source]
Plot a merged, aligned cross-rank device-memory timeline.
- Parameters:
events (List[TelemetryEventV2])
save_path (str | None)
- Return type:
matplotlib.pyplot.Figure
- plot_function_comparison(results=None, metric='memory_allocated', save_path=None, interactive=True)[source]
Compare memory usage across different functions.
- Parameters:
results (List[ProfileResult] | None) – List of ProfileResults to compare
metric (str) – Metric to compare (‘memory_allocated’, ‘execution_time’, ‘peak_memory’)
save_path (str | None) – Path to save the plot
interactive (bool) – Whether to create interactive plot
- Returns:
Matplotlib or Plotly figure
- Return type:
matplotlib.pyplot.Figure | plotly.graph_objects.Figure
- plot_memory_heatmap(results=None, save_path=None)[source]
Create a heatmap showing memory usage patterns.
- Parameters:
results (List[ProfileResult] | None) – List of ProfileResults to analyze
save_path (str | None) – Path to save the plot
- Returns:
Matplotlib figure
- Return type:
matplotlib.pyplot.Figure
- create_dashboard(results=None, snapshots=None, save_path=None)[source]
Create a comprehensive dashboard with multiple visualizations.
- Parameters:
results (List[ProfileResult] | None) – List of ProfileResults
snapshots (List[MemorySnapshot] | None) – List of MemorySnapshots
save_path (str | None) – Path to save the dashboard
- Returns:
Plotly figure with subplots
- Return type:
plotly.graph_objects.Figure
- export_data(results=None, snapshots=None, format='csv', save_path='memory_profile_data')[source]
Export profiling data to various formats.
- Parameters:
results (List[ProfileResult] | None) – List of ProfileResults to export
snapshots (List[MemorySnapshot] | None) – List of MemorySnapshots to export
format (str) – Export format (‘csv’, ‘json’, ‘excel’)
save_path (str) – Base path for saved files
- Returns:
Path to saved file
- Return type:
str
- class stormlog.MemoryAnalyzer(profiler=None, collective_sensitivity='medium', collective_threshold_overrides=None)[source]
Bases:
objectAdvanced analyzer for memory profiling data.
- Parameters:
profiler (GPUMemoryProfiler | None)
collective_sensitivity (str)
collective_threshold_overrides (Mapping[str, Any] | None)
- analyze_memory_patterns(results=None)[source]
Detect memory usage patterns in profiling data.
- Parameters:
results (List[ProfileResult] | None) – List of ProfileResults to analyze
- Returns:
List of detected patterns
- Return type:
List[MemoryPattern]
- generate_performance_insights(results=None)[source]
Generate performance insights from profiling data.
- Parameters:
results (List[ProfileResult] | None) – List of ProfileResults to analyze
- Returns:
List of performance insights
- Return type:
List[PerformanceInsight]
- analyze_memory_gaps(events, *, phase_resolver=None)[source]
Classify allocator-vs-device hidden memory gaps over time.
- Parameters:
events (List[TelemetryEventV2]) – Chronologically ordered telemetry samples.
phase_resolver (PhaseReplayIndex | None)
- Returns:
Prioritized list of gap findings (severity desc, confidence desc).
- Return type:
List[GapFinding]
- analyze_cross_rank_timeline(events, *, phase_resolver=None)[source]
Merge rank timelines and detect the earliest cluster-wide spike cause.
- Parameters:
events (List[TelemetryEventV2])
phase_resolver (PhaseReplayIndex | None)
- Return type:
Dict[str, Any]
- analyze_collective_attribution(events, *, phase_resolver=None)[source]
Attribute hidden-memory spikes to collective communication phases.
- Parameters:
events (List[TelemetryEventV2])
phase_resolver (PhaseReplayIndex | None)
- Return type:
- generate_optimization_report(results=None, events=None)[source]
Generate a comprehensive optimization report.
- Parameters:
results (List[ProfileResult] | None) – List of ProfileResults to analyze
events (List[TelemetryEventV2] | None) – Optional telemetry event series for gap analysis. When provided, the report includes a
gap_analysissection.
- Returns:
Comprehensive optimization report
- Return type:
Dict[str, Any]
- class stormlog.GapFinding(classification, severity, confidence, evidence, description, remediation, evidence_timestamp_ns=None, phase_attribution=None)[source]
Bases:
objectA classified finding from hidden-memory gap analysis.
- Parameters:
classification (str)
severity (str)
confidence (float)
evidence (dict[str, Any])
description (str)
remediation (List[str])
evidence_timestamp_ns (int | None)
phase_attribution (PhaseAttribution | None)
- classification: str
- severity: str
- confidence: float
- evidence: dict[str, Any]
- description: str
- remediation: List[str]
- evidence_timestamp_ns: int | None = None
- phase_attribution: PhaseAttribution | None = None
- class stormlog.MemoryTracker(device=None, sampling_interval=0.1, max_events=10000, enable_alerts=True, enable_oom_flight_recorder=False, oom_dump_dir='oom_dumps', oom_buffer_size=None, oom_max_dumps=5, oom_max_total_mb=256, job_id=None, rank=None, local_rank=None, world_size=None, enable_native_cuda_history=False, native_history_max_entries=100000, telemetry_sink_config=None)[source]
Bases:
objectReal-time memory tracker with alerts and monitoring.
- Parameters:
device (str | int | torch.device | None)
sampling_interval (float)
max_events (int)
enable_alerts (bool)
enable_oom_flight_recorder (bool)
oom_dump_dir (str)
oom_buffer_size (int | None)
oom_max_dumps (int)
oom_max_total_mb (int)
job_id (str | None)
rank (int | None)
local_rank (int | None)
world_size (int | None)
enable_native_cuda_history (bool)
native_history_max_entries (int)
telemetry_sink_config (TelemetrySinkConfig | None)
- get_session_summary()[source]
Return the current or most recent tracking session summary.
- Return type:
SessionSummary | None
- property oom_buffer_size: int
Resolved OOM ring-buffer size.
- enter_phase(name, *, metadata=None)[source]
Enter one structured workload phase while tracking is active.
- Parameters:
name (str)
metadata (Dict[str, Any] | None)
- Return type:
- phase(name, *, metadata=None)[source]
Context manager that emits structured phase enter and exit records.
- Parameters:
name (str)
metadata (Dict[str, Any] | None)
- Return type:
Any
- handle_exception(exc, context=None, metadata=None)[source]
Capture OOM diagnostics for recognized OOM exceptions.
- Parameters:
exc (BaseException)
context (str | None)
metadata (Dict[str, Any] | None)
- Return type:
str | None
- capture_oom(context='runtime', metadata=None)[source]
Capture OOM diagnostic bundle if a tracked block raises OOM.
- Parameters:
context (str)
metadata (Dict[str, Any] | None)
- Return type:
Any
- add_alert_callback(callback)[source]
Add a callback function to be called on alerts.
- Parameters:
callback (Callable[[TrackingEvent], None])
- Return type:
None
- remove_alert_callback(callback)[source]
Remove an alert callback.
- Parameters:
callback (Callable[[TrackingEvent], None])
- Return type:
None
- get_events(event_type=None, last_n=None, since=None)[source]
Get tracking events with optional filtering.
- Parameters:
event_type (str | None) – Filter by event type
last_n (int | None) – Get last N events
since (float | None) – Get events since timestamp
- Returns:
List of filtered events
- Return type:
List[TrackingEvent]
- get_memory_timeline(interval=1.0)[source]
Get memory usage timeline with specified interval.
- Parameters:
interval (float) – Time interval in seconds for aggregation
- Returns:
Dictionary with timeline data
- Return type:
Dict[str, List]
- export_events(filename, format='csv')[source]
Export tracking events to file.
- Parameters:
filename (str) – Output filename
format (str) – Export format (‘csv’ or ‘json’)
- Return type:
None
- set_threshold(threshold_name, value)[source]
Set alert threshold.
- Parameters:
threshold_name (str) – Name of the threshold
value (int | float) – Threshold value
- Return type:
None
- get_alerts(last_n=None)[source]
Get all alert events (warnings, critical, errors).
- Parameters:
last_n (int | None)
- Return type:
List[TrackingEvent]
- class stormlog.OOMFlightRecorder(config)[source]
Bases:
objectBounded recorder that writes dump bundles on OOM.
- Parameters:
config (OOMFlightRecorderConfig)
- record_event(event)[source]
Append one event payload to the in-memory ring buffer.
- Parameters:
event (dict[str, Any])
- Return type:
None
- snapshot_events()[source]
Return buffered events in chronological order.
- Return type:
list[dict[str, Any]]
- dump(*, reason, exception, context, backend, metadata=None, session_summary=None)[source]
Write an OOM diagnostic bundle and enforce retention constraints.
- Parameters:
reason (str)
exception (BaseException)
context (str | None)
backend (str)
metadata (dict[str, Any] | None)
session_summary (SessionSummary | None)
- Return type:
str | None
- class stormlog.OOMFlightRecorderConfig(enabled=False, dump_dir='oom_dumps', buffer_size=10000, max_dumps=5, max_total_mb=256)[source]
Bases:
objectRuntime configuration for OOM flight recorder dumps.
- Parameters:
enabled (bool)
dump_dir (str)
buffer_size (int)
max_dumps (int)
max_total_mb (int)
- enabled: bool = False
- dump_dir: str = 'oom_dumps'
- buffer_size: int = 10000
- max_dumps: int = 5
- max_total_mb: int = 256
- class stormlog.OOMExceptionClassification(is_oom, reason)[source]
Bases:
objectNormalized classification result for an exception.
- Parameters:
is_oom (bool)
reason (str | None)
- is_oom: bool
- reason: str | None
- stormlog.classify_oom_exception(exc)[source]
Classify whether an exception corresponds to an OOM condition.
- Parameters:
exc (BaseException)
- Return type:
- class stormlog.TelemetryEventV2(schema_version, timestamp_ns, event_type, collector, sampling_interval_ms, pid, host, device_id, allocator_allocated_bytes, allocator_reserved_bytes, allocator_active_bytes, allocator_inactive_bytes, allocator_change_bytes, device_used_bytes, device_free_bytes, device_total_bytes, context, job_id=None, rank=0, local_rank=0, world_size=1, metadata=<factory>)[source]
Bases:
objectLegacy v2 telemetry event payload retained for backward-compatible writes/tests.
- Parameters:
schema_version (Literal[2])
timestamp_ns (int)
event_type (str)
collector (str)
sampling_interval_ms (int)
pid (int)
host (str)
device_id (int)
allocator_allocated_bytes (int)
allocator_reserved_bytes (int)
allocator_active_bytes (int | None)
allocator_inactive_bytes (int | None)
allocator_change_bytes (int)
device_used_bytes (int)
device_free_bytes (int | None)
device_total_bytes (int | None)
context (str | None)
job_id (str | None)
rank (int)
local_rank (int)
world_size (int)
metadata (dict[str, Any])
- schema_version: Literal[2]
- timestamp_ns: int
- event_type: str
- collector: str
- sampling_interval_ms: int
- pid: int
- host: str
- device_id: int
- allocator_allocated_bytes: int
- allocator_reserved_bytes: int
- allocator_active_bytes: int | None
- allocator_inactive_bytes: int | None
- allocator_change_bytes: int
- device_used_bytes: int
- device_free_bytes: int | None
- device_total_bytes: int | None
- context: str | None
- job_id: str | None = None
- rank: int = 0
- local_rank: int = 0
- world_size: int = 1
- metadata: dict[str, Any]
- class stormlog.DeviceMemoryCollector[source]
Bases:
ABCBackend-specific collector contract for device memory signals.
- abstract is_available()[source]
Return whether this collector can sample in the current runtime.
- Return type:
bool
- class stormlog.DeviceMemorySample(allocated_bytes, reserved_bytes, used_bytes, free_bytes, total_bytes, active_bytes, inactive_bytes, device_id)[source]
Bases:
objectNormalized device-memory sample produced by a backend collector.
- Parameters:
allocated_bytes (int)
reserved_bytes (int)
used_bytes (int)
free_bytes (int | None)
total_bytes (int | None)
active_bytes (int | None)
inactive_bytes (int | None)
device_id (int)
- allocated_bytes: int
- reserved_bytes: int
- used_bytes: int
- free_bytes: int | None
- total_bytes: int | None
- active_bytes: int | None
- inactive_bytes: int | None
- device_id: int
- stormlog.build_device_memory_collector(device=None)[source]
Build a backend collector for CUDA/ROCm/MPS runtime environments.
- Parameters:
device (str | int | torch.device | None)
- Return type:
- stormlog.detect_torch_runtime_backend()[source]
Return the active torch runtime backend in this environment.
- Return type:
str
- class stormlog.CPUMemoryProfiler[source]
Bases:
objectLightweight CPU memory profiler mirroring the GPU API.
- class stormlog.CPUMemoryTracker(sampling_interval=0.5, max_events=10000, enable_alerts=True, job_id=None, rank=None, local_rank=None, world_size=None, telemetry_sink_config=None)[source]
Bases:
objectCPU tracker offering a superset of the GPU tracker interface.
- Parameters:
sampling_interval (float)
max_events (int)
enable_alerts (bool)
job_id (Optional[str])
rank (Optional[int])
local_rank (Optional[int])
world_size (Optional[int])
telemetry_sink_config (Optional[TelemetrySinkConfig])
- get_session_summary()[source]
- Return type:
SessionSummary | None
- enter_phase(name, *, metadata=None)[source]
Enter one structured CPU tracking phase.
- Parameters:
name (str)
metadata (Dict[str, Any] | None)
- Return type:
- phase(name, *, metadata=None)[source]
Context manager that emits structured CPU phase telemetry.
- Parameters:
name (str)
metadata (Dict[str, Any] | None)
- Return type:
Any
- get_events(event_type=None, last_n=None, since=None)[source]
Get tracking events with optional filtering.
- Parameters:
event_type (str | None) – Filter by event type
last_n (int | None) – Get last N events
since (float | None) – Get events since timestamp
- Returns:
List of filtered events
- Return type:
List[TrackingEvent]
- get_memory_timeline(interval=1.0)[source]
- Parameters:
interval (float)
- Return type:
Dict[str, List[float]]
- stormlog.telemetry_event_from_record(record, permissive_legacy=True, default_collector='legacy.unknown', default_sampling_interval_ms=0, default_session_id=None)[source]
Create a canonical telemetry event from v3, v2, or legacy records.
- Parameters:
record (Mapping[str, Any])
permissive_legacy (bool)
default_collector (str)
default_sampling_interval_ms (int)
default_session_id (str | None)
- Return type:
- stormlog.telemetry_event_to_dict(event)[source]
Serialize a telemetry event to a plain dictionary.
- Parameters:
event (TelemetryEventV3 | TelemetryEventV2)
- Return type:
dict[str, Any]
- stormlog.validate_telemetry_record(record)[source]
Validate a v2 or v3 telemetry record.
- Raises:
ValueError – if the record is invalid or partial.
- Parameters:
record (Mapping[str, Any])
- Return type:
None
- stormlog.load_telemetry_events(path, permissive_legacy=True, events_key=None, session_id=None)[source]
Load telemetry events from JSON and return the selected session.
- Parameters:
path (str | Path)
permissive_legacy (bool)
events_key (str | None)
session_id (str | None)
- Return type:
list[TelemetryEventV3]
- stormlog.resolve_distributed_identity(*, job_id=None, rank=None, local_rank=None, world_size=None, metadata=None, env=None)[source]
Normalize distributed identity fields from explicit, metadata, or env inputs.
- Parameters:
job_id (Any)
rank (Any)
local_rank (Any)
world_size (Any)
metadata (Mapping[str, Any] | None)
env (Mapping[str, str] | None)
- Return type:
dict[str, Any]
- class stormlog.TimelineMarker(session_id, start_ns, end_ns, kind, source, severity, label, rank=None, local_rank=None, world_size=None, event_type=None, metadata=<factory>)[source]
Bases:
objectNormalized timeline landmark derived from telemetry or annotation sources.
- Parameters:
session_id (str)
start_ns (int)
end_ns (int | None)
kind (str)
source (str)
severity (str)
label (str)
rank (int | None)
local_rank (int | None)
world_size (int | None)
event_type (str | None)
metadata (dict[str, Any])
- session_id: str
- start_ns: int
- end_ns: int | None
- kind: str
- source: str
- severity: str
- label: str
- rank: int | None = None
- local_rank: int | None = None
- world_size: int | None = None
- event_type: str | None = None
- metadata: dict[str, Any]
- property is_interval: bool
Return whether the marker spans a non-point interval.
- stormlog.derive_timeline_markers(events, *, include_phase_markers=True)[source]
Derive normalized timeline markers from telemetry events.
- Parameters:
events (Sequence[Any])
include_phase_markers (bool)
- Return type:
list[TimelineMarker]
- stormlog.derive_session_timeline_markers(session, *, include_phase_markers=True)[source]
Derive normalized markers from one loaded telemetry session.
- Parameters:
session (LoadedTelemetrySession)
include_phase_markers (bool)
- Return type:
list[TimelineMarker]
- stormlog.timeline_marker_to_dict(marker)[source]
Serialize a marker into a JSON-safe mapping.
- Parameters:
marker (TimelineMarker)
- Return type:
dict[str, Any]
- stormlog.get_gpu_info(device=None)[source]
Get comprehensive GPU information.
- Parameters:
device (str | int | torch.device | None) – GPU device to query (None for current device)
- Returns:
Dictionary with GPU information
- Return type:
Dict[str, Any]
- stormlog.format_bytes(bytes_value, precision=2)[source]
Format bytes into human-readable format.
- Parameters:
bytes_value (int) – Number of bytes
precision (int) – Decimal precision
- Returns:
Formatted string (e.g., “1.25 GB”)
- Return type:
str
- stormlog.convert_bytes(value, from_unit, to_unit)[source]
Convert between different byte units.
- Parameters:
value (int | float) – Value to convert
from_unit (str) – Source unit (B, KB, MB, GB, TB)
to_unit (str) – Target unit (B, KB, MB, GB, TB)
- Returns:
Converted value
- Return type:
float
Modules
Advanced analysis tools for memory profiling data. |
|
Stormlog-native memory visualisation with tensor attribution. |
|
Command-line interface for Stormlog. |
|
Heuristics for attributing hidden-memory spikes to collective communication. |
|
Shared collector-health state and retry helpers. |
|
Context profiler for easy function and code block profiling. |
|
CPU-only memory profiler and tracker. |
|
CUDA-native allocator history capture and attribution helpers. |
|
Registry-driven derived-field layer for Stormlog telemetry. |
|
Backend-aware device memory collector abstractions. |
|
Diagnostic bundle builder for the Stormlog diagnose command. |
|
Distributed telemetry analysis helpers. |
|
Shared hidden-memory gap analysis utilities. |
|
Durable issue fingerprints and grouped issue row models. |
|
JAX support for Stormlog. |
|
OOM flight recorder helpers for bounded event capture and dump artifacts. |
|
Structured phase telemetry helpers for trackers and analysis. |
|
Core Stormlog for PyTorch. |
|
Local query API for Stormlog artifact directories and telemetry files. |
|
Command-line interface for local Stormlog artifact queries. |
|
Helpers for deriving the next release version from Git tags. |
|
Shared session identity and lifecycle helpers. |
|
Canonical telemetry event schema and legacy conversion helpers. |
|
Backend-neutral projection over the persisted telemetry event schema. |
|
Append-only telemetry sink with rollover and retention bounds. |
|
TensorFlow support for Stormlog. |
|
Derived timeline marker helpers for telemetry sessions. |
|
Real-time memory tracking and monitoring. |
|
Textual-based terminal UI and top-level Stormlog dispatcher. |
|
Utility functions for GPU memory profiling. |
|
Visualization tools for GPU memory profiling data. |
|
Optional Weights & Biases export helpers for Stormlog outputs. |