stormlog.tracker
Real-time memory tracking and monitoring.
Classes
|
Real-time memory tracker with alerts and monitoring. |
|
Memory watchdog for automated memory management. |
|
Represents a memory tracking event. |
- class stormlog.tracker.TrackingEvent(timestamp, event_type, memory_allocated, memory_reserved, memory_change, device_id, session_id=None, context=None, job_id=None, rank=0, local_rank=0, world_size=1, metadata=None, active_memory=None, inactive_memory=None, device_used=None, device_free=None, device_total=None, backend='cuda')[source]
Bases:
objectRepresents a memory tracking event.
- Parameters:
timestamp (float)
event_type (str)
memory_allocated (int)
memory_reserved (int)
memory_change (int)
device_id (int)
session_id (str | None)
context (str | None)
job_id (str | None)
rank (int)
local_rank (int)
world_size (int)
metadata (Dict[str, Any] | None)
active_memory (int | None)
inactive_memory (int | None)
device_used (int | None)
device_free (int | None)
device_total (int | None)
backend (str)
- timestamp: float
- event_type: str
- memory_allocated: int
- memory_reserved: int
- memory_change: int
- device_id: int
- session_id: str | None = None
- context: str | None = None
- job_id: str | None = None
- rank: int = 0
- local_rank: int = 0
- world_size: int = 1
- metadata: Dict[str, Any] | None = None
- active_memory: int | None = None
- inactive_memory: int | None = None
- device_used: int | None = None
- device_free: int | None = None
- device_total: int | None = None
- backend: str = 'cuda'
- class stormlog.tracker.MemoryTracker(device=None, sampling_interval=0.1, max_events=10000, enable_alerts=True, enable_oom_flight_recorder=False, oom_dump_dir='oom_dumps', oom_buffer_size=None, oom_max_dumps=5, oom_max_total_mb=256, job_id=None, rank=None, local_rank=None, world_size=None, enable_native_cuda_history=False, native_history_max_entries=100000, telemetry_sink_config=None)[source]
Bases:
objectReal-time memory tracker with alerts and monitoring.
- Parameters:
device (str | int | torch.device | None)
sampling_interval (float)
max_events (int)
enable_alerts (bool)
enable_oom_flight_recorder (bool)
oom_dump_dir (str)
oom_buffer_size (int | None)
oom_max_dumps (int)
oom_max_total_mb (int)
job_id (str | None)
rank (int | None)
local_rank (int | None)
world_size (int | None)
enable_native_cuda_history (bool)
native_history_max_entries (int)
telemetry_sink_config (TelemetrySinkConfig | None)
- get_session_summary()[source]
Return the current or most recent tracking session summary.
- Return type:
SessionSummary | None
- property oom_buffer_size: int
Resolved OOM ring-buffer size.
- enter_phase(name, *, metadata=None)[source]
Enter one structured workload phase while tracking is active.
- Parameters:
name (str)
metadata (Dict[str, Any] | None)
- Return type:
- phase(name, *, metadata=None)[source]
Context manager that emits structured phase enter and exit records.
- Parameters:
name (str)
metadata (Dict[str, Any] | None)
- Return type:
Any
- handle_exception(exc, context=None, metadata=None)[source]
Capture OOM diagnostics for recognized OOM exceptions.
- Parameters:
exc (BaseException)
context (str | None)
metadata (Dict[str, Any] | None)
- Return type:
str | None
- capture_oom(context='runtime', metadata=None)[source]
Capture OOM diagnostic bundle if a tracked block raises OOM.
- Parameters:
context (str)
metadata (Dict[str, Any] | None)
- Return type:
Any
- add_alert_callback(callback)[source]
Add a callback function to be called on alerts.
- Parameters:
callback (Callable[[TrackingEvent], None])
- Return type:
None
- remove_alert_callback(callback)[source]
Remove an alert callback.
- Parameters:
callback (Callable[[TrackingEvent], None])
- Return type:
None
- get_events(event_type=None, last_n=None, since=None)[source]
Get tracking events with optional filtering.
- Parameters:
event_type (str | None) – Filter by event type
last_n (int | None) – Get last N events
since (float | None) – Get events since timestamp
- Returns:
List of filtered events
- Return type:
List[TrackingEvent]
- get_memory_timeline(interval=1.0)[source]
Get memory usage timeline with specified interval.
- Parameters:
interval (float) – Time interval in seconds for aggregation
- Returns:
Dictionary with timeline data
- Return type:
Dict[str, List]
- export_events(filename, format='csv')[source]
Export tracking events to file.
- Parameters:
filename (str) – Output filename
format (str) – Export format (‘csv’ or ‘json’)
- Return type:
None
- set_threshold(threshold_name, value)[source]
Set alert threshold.
- Parameters:
threshold_name (str) – Name of the threshold
value (int | float) – Threshold value
- Return type:
None
- get_alerts(last_n=None)[source]
Get all alert events (warnings, critical, errors).
- Parameters:
last_n (int | None)
- Return type:
List[TrackingEvent]
- class stormlog.tracker.MemoryWatchdog(tracker, auto_cleanup=True, cleanup_threshold=0.9, aggressive_cleanup_threshold=0.95)[source]
Bases:
objectMemory watchdog for automated memory management.
- Parameters:
tracker (MemoryTracker)
auto_cleanup (bool)
cleanup_threshold (float)
aggressive_cleanup_threshold (float)