stormlog.tensorflow.tracker
Real-time TensorFlow Memory Tracking
This module provides real-time monitoring of GPU memory usage during TensorFlow model training and inference, with configurable alerts and automatic cleanup.
Classes
|
Real-time TensorFlow GPU memory tracker. |
|
Automatic memory management and cleanup for TensorFlow. |
|
Results from real-time memory tracking. |
- class stormlog.tensorflow.tracker.TrackingResult(start_time, end_time, memory_usage=<factory>, timestamps=<factory>, events=<factory>, alerts_triggered=<factory>, peak_memory=0.0, average_memory=0.0, min_memory=inf, history_window_limit=0, history_retained_samples=0, history_dropped_samples=0, history_retained_events=0, history_dropped_events=0, history_retained_alerts=0, history_dropped_alerts=0)[source]
Bases:
objectResults from real-time memory tracking.
- Parameters:
start_time (float)
end_time (float)
memory_usage (List[float])
timestamps (List[float])
events (List[Dict])
alerts_triggered (List[Dict])
peak_memory (float)
average_memory (float)
min_memory (float)
history_window_limit (int)
history_retained_samples (int)
history_dropped_samples (int)
history_retained_events (int)
history_dropped_events (int)
history_retained_alerts (int)
history_dropped_alerts (int)
- start_time: float
- end_time: float
- memory_usage: List[float]
- timestamps: List[float]
- events: List[Dict]
- alerts_triggered: List[Dict]
- peak_memory: float = 0.0
- average_memory: float = 0.0
- min_memory: float = inf
- history_window_limit: int = 0
- history_retained_samples: int = 0
- history_dropped_samples: int = 0
- history_retained_events: int = 0
- history_dropped_events: int = 0
- history_retained_alerts: int = 0
- history_dropped_alerts: int = 0
- property duration: float
Total tracking duration.
- property memory_growth_rate: float
Memory growth rate in MB/second.
- class stormlog.tensorflow.tracker.MemoryTracker(sampling_interval=1.0, alert_threshold_mb=None, device=None, enable_logging=True, max_history=10000, job_id=None, rank=None, local_rank=None, world_size=None, telemetry_sink_config=None)[source]
Bases:
objectReal-time TensorFlow GPU memory tracker.
- Parameters:
sampling_interval (float)
alert_threshold_mb (float | None)
device (str | None)
enable_logging (bool)
max_history (int)
job_id (str | None)
rank (int | None)
local_rank (int | None)
world_size (int | None)
telemetry_sink_config (TelemetrySinkConfig | None)
- property memory_usage: List[float]
- property timestamps: List[float]
- property events: List[Dict[str, Any]]
- property alerts: List[Dict[str, Any]]
- get_session_summary()[source]
Return the active or most recent TensorFlow tracking session.
- Return type:
SessionSummary | None
- add_alert_callback(callback)[source]
Add callback function for memory alerts.
- Parameters:
callback (Callable[[Dict[str, Any]], None])
- Return type:
None
- get_statistics()[source]
Return current tracker health and latest successful memory sample.
- Return type:
dict[str, Any]
- enter_phase(name, *, metadata=None)[source]
Enter one structured TensorFlow tracking phase.
- Parameters:
name (str)
metadata (Dict[str, Any] | None)
- Return type:
- phase(name, *, metadata=None)[source]
Context manager that emits structured TensorFlow phase telemetry.
- Parameters:
name (str)
metadata (Dict[str, Any] | None)
- Return type:
Any
- set_alert_threshold(threshold_mb)[source]
Update alert threshold.
- Parameters:
threshold_mb (float)
- Return type:
None
- class stormlog.tensorflow.tracker.MemoryWatchdog(max_memory_mb=8000, cleanup_threshold_mb=6000, check_interval=5.0)[source]
Bases:
objectAutomatic memory management and cleanup for TensorFlow.
- Parameters:
max_memory_mb (float)
cleanup_threshold_mb (float)
check_interval (float)