stormlog.tensorflow.tracker

Real-time TensorFlow Memory Tracking

This module provides real-time monitoring of GPU memory usage during TensorFlow model training and inference, with configurable alerts and automatic cleanup.

Classes

MemoryTracker([sampling_interval, ...])

Real-time TensorFlow GPU memory tracker.

MemoryWatchdog([max_memory_mb, ...])

Automatic memory management and cleanup for TensorFlow.

TrackingResult(start_time, end_time[, ...])

Results from real-time memory tracking.

class stormlog.tensorflow.tracker.TrackingResult(start_time, end_time, memory_usage=<factory>, timestamps=<factory>, events=<factory>, alerts_triggered=<factory>, peak_memory=0.0, average_memory=0.0, min_memory=inf, history_window_limit=0, history_retained_samples=0, history_dropped_samples=0, history_retained_events=0, history_dropped_events=0, history_retained_alerts=0, history_dropped_alerts=0)[source]

Bases: object

Results from real-time memory tracking.

Parameters:
  • start_time (float)

  • end_time (float)

  • memory_usage (List[float])

  • timestamps (List[float])

  • events (List[Dict])

  • alerts_triggered (List[Dict])

  • peak_memory (float)

  • average_memory (float)

  • min_memory (float)

  • history_window_limit (int)

  • history_retained_samples (int)

  • history_dropped_samples (int)

  • history_retained_events (int)

  • history_dropped_events (int)

  • history_retained_alerts (int)

  • history_dropped_alerts (int)

start_time: float
end_time: float
memory_usage: List[float]
timestamps: List[float]
events: List[Dict]
alerts_triggered: List[Dict]
peak_memory: float = 0.0
average_memory: float = 0.0
min_memory: float = inf
history_window_limit: int = 0
history_retained_samples: int = 0
history_dropped_samples: int = 0
history_retained_events: int = 0
history_dropped_events: int = 0
history_retained_alerts: int = 0
history_dropped_alerts: int = 0
property duration: float

Total tracking duration.

property memory_growth_rate: float

Memory growth rate in MB/second.

class stormlog.tensorflow.tracker.MemoryTracker(sampling_interval=1.0, alert_threshold_mb=None, device=None, enable_logging=True, max_history=10000, job_id=None, rank=None, local_rank=None, world_size=None, telemetry_sink_config=None)[source]

Bases: object

Real-time TensorFlow GPU memory tracker.

Parameters:
  • sampling_interval (float)

  • alert_threshold_mb (float | None)

  • device (str | None)

  • enable_logging (bool)

  • max_history (int)

  • job_id (str | None)

  • rank (int | None)

  • local_rank (int | None)

  • world_size (int | None)

  • telemetry_sink_config (TelemetrySinkConfig | None)

property memory_usage: List[float]
property timestamps: List[float]
property events: List[Dict[str, Any]]
property alerts: List[Dict[str, Any]]
get_session_summary()[source]

Return the active or most recent TensorFlow tracking session.

Return type:

SessionSummary | None

add_alert_callback(callback)[source]

Add callback function for memory alerts.

Parameters:

callback (Callable[[Dict[str, Any]], None])

Return type:

None

start_tracking()[source]

Start real-time memory tracking.

Return type:

None

stop_tracking()[source]

Stop tracking and return results.

Return type:

TrackingResult

get_current_memory()[source]

Get current memory usage.

Return type:

float

get_statistics()[source]

Return current tracker health and latest successful memory sample.

Return type:

dict[str, Any]

enter_phase(name, *, metadata=None)[source]

Enter one structured TensorFlow tracking phase.

Parameters:
  • name (str)

  • metadata (Dict[str, Any] | None)

Return type:

PhaseHandle

phase(name, *, metadata=None)[source]

Context manager that emits structured TensorFlow phase telemetry.

Parameters:
  • name (str)

  • metadata (Dict[str, Any] | None)

Return type:

Any

set_alert_threshold(threshold_mb)[source]

Update alert threshold.

Parameters:

threshold_mb (float)

Return type:

None

check_alerts()[source]

Check if any alerts have been triggered recently.

Return type:

bool

get_tracking_results()[source]

Get current tracking results without stopping.

Return type:

TrackingResult

class stormlog.tensorflow.tracker.MemoryWatchdog(max_memory_mb=8000, cleanup_threshold_mb=6000, check_interval=5.0)[source]

Bases: object

Automatic memory management and cleanup for TensorFlow.

Parameters:
  • max_memory_mb (float)

  • cleanup_threshold_mb (float)

  • check_interval (float)

add_cleanup_callback(callback)[source]

Add cleanup callback function.

Parameters:

callback (Callable[[], None])

Return type:

None

start()[source]

Start memory watchdog.

Return type:

None

stop()[source]

Stop memory watchdog.

Return type:

None

force_cleanup()[source]

Force immediate memory cleanup.

Return type:

None