← Back to main docs

TelemetryEvent v3 Schema

TelemetryEvent v3 is the canonical event format for tracker exports, append-only sink segments, synthesized diagnose timelines, and loader output. The backend-neutral ProjectedTelemetryRecord described in Stormlog Telemetry Projection is an internal view derived from this persisted schema; it is not a replacement artifact format.

Schema files:

  • docs/schemas/telemetry_event_v3.schema.json

  • legacy compatibility: docs/schemas/telemetry_event_v2.schema.json

Required fields

  • schema_version (3)

  • session_id

  • timestamp_ns

  • event_type

  • collector

  • sampling_interval_ms

  • pid

  • host

  • device_id

  • allocator_allocated_bytes

  • allocator_reserved_bytes

  • allocator_active_bytes

  • allocator_inactive_bytes

  • allocator_change_bytes

  • device_used_bytes

  • device_free_bytes

  • device_total_bytes

  • context

  • metadata

Session identity and lifecycle

Every long-running capture now belongs to exactly one session.

  • session_id is an opaque UUID string generated once per track run, TUI live session, or standalone diagnose bundle

  • session identity is the authoritative grouping key for telemetry and attached artifacts

  • host, job, and rank remain descriptive metadata, but they do not define capture ownership

Session lifecycle is recorded separately from the event stream:

  • running: session started but no terminal state has been persisted yet

  • completed: clean shutdown finished and terminal session state was written

  • interrupted: a previously running append-only sink session was recovered on the next startup

  • incomplete: loaders found partial or orphaned artifacts that cannot prove a clean or recovered stop

Default session selection when multiple sessions are present:

  1. newest completed

  2. newest interrupted

  3. newest incomplete

Distributed identity fields

TelemetryEvent v3 also recognizes these top-level distributed identity fields:

  • job_id (string | null)

  • rank (integer)

  • local_rank (integer)

  • world_size (integer)

New exports always emit these fields. For single-process runs, the defaults are:

  • job_id -> null

  • rank -> 0

  • local_rank -> 0

  • world_size -> 1

TelemetryEvent v3 validation is strict:

  • unknown top-level fields are rejected

  • metadata must be a JSON object (dict in Python terms)

  • rank and local_rank must be >= 0

  • world_size must be >= 1

  • rank and local_rank must be < world_size

Collector values

  • stormlog.cuda_tracker

  • stormlog.rocm_tracker

  • stormlog.mps_tracker

  • stormlog.cpu_tracker

  • stormlog.tensorflow.memory_tracker

Backend capability metadata

Tracker exports may include backend capability hints under metadata:

  • backend

  • supports_device_total

  • supports_device_free

  • sampling_source

Collector health metadata

Always-on tracker exports annotate collector degradation in metadata:

  • collector_health_status (healthy, degraded, unhealthy)

  • telemetry_partial (bool)

  • collector_partial_fields (list[str])

  • collector_last_error (string | null)

  • collector_consecutive_failures (integer)

  • collector_next_retry_epoch_s (float | null)

Tracker exports may also emit these lifecycle events:

  • collector_degraded

  • collector_recovered

  • start

  • stop

  • phase_enter

  • phase_exit

When the collector cannot produce core metrics, Stormlog pauses sample emission until the next retry window and records the degraded state instead of exporting synthetic zero-valued samples.

Structured phase metadata

Phase-aware trackers store workload boundaries under metadata["phase_scope"]. This payload is attached to phase_enter and phase_exit events and is replayed later by the analyzer and TUI.

The current shape is:

  • action (enter or exit)

  • name (leaf phase label)

  • path (list[str])

  • depth (int)

  • scope_id (string)

  • parent_scope_id (string | null)

  • thread_id (int)

  • thread_name (string)

  • sequence (int)

  • attributes (object, optional)

Example:

{
  "event_type": "phase_enter",
  "metadata": {
    "phase_scope": {
      "action": "enter",
      "name": "forward",
      "path": ["train", "forward"],
      "depth": 2,
      "scope_id": "session-1:2",
      "parent_scope_id": "session-1:1",
      "thread_id": 88,
      "thread_name": "MainThread",
      "sequence": 2,
      "attributes": {"epoch": 3}
    }
  }
}

Timeline markers

Timeline markers are a derived view over canonical telemetry, not a new TelemetryEvent top-level schema. This keeps v3 event validation strict while letting CLI, TUI, and query surfaces align important landmarks on one timeline.

The first marker contract is exposed through stormlog.timeline_markers:

  • TimelineMarker

  • derive_timeline_markers(events)

  • derive_session_timeline_markers(loaded_session)

  • timeline_marker_to_dict(marker)

TUI artifact loading exposes the same derived view without mutating session events: load_distributed_artifacts(...).markers contains markers for the selected session, and build_distributed_model(...).markers_by_rank groups them for rank timeline rendering.

Canonical marker fields:

  • session_id

  • start_ns

  • end_ns (null for point markers)

  • kind (lifecycle, collector, alert, oom, or phase)

  • source (telemetry_event or phase_replay)

  • severity (info, warning, or critical)

  • label

  • rank, local_rank, world_size

  • event_type

  • metadata

System-generated point markers are promoted from existing telemetry events:

  • start

  • stop

  • collector_degraded

  • collector_recovered

  • warning

  • critical

  • error

error events with OOM metadata such as metadata["oom_reason"] are promoted as oom markers. Structured phases are promoted as interval markers by replaying matching phase_enter and phase_exit records through PhaseReplayIndex.

User-authored annotations should remain separate from raw telemetry events in a future sidecar or catalog layer. They can share the TimelineMarker shape, but they should use an annotation-specific source rather than mutating historical telemetry records.

Analyzer phase attribution payloads

Analyze and Diagnostics outputs may also include a derived phase_attribution object. This is not part of the raw event schema above; it is report-layer data produced after replaying the phase_scope boundaries.

Canonical fields:

  • phase_resolution (unique, ambiguous, or omitted when no attribution exists)

  • phase_source (exact, thread_local, heuristic, or omitted)

  • phase_path (string, only for unique attributions)

  • phase_paths (list[str], one or more candidate labels)

  • scope_id, thread_id, thread_name (present only when uniquely tied to one scope)

Optional presentation field:

  • phase_summary

    • phase_path

    • source

phase_summary is emitted only when the product wants one useful display label even though the canonical attribution remains ambiguous. For example, the CLI may show (likely) train / communication while the underlying phase_attribution.phase_resolution still stays ambiguous.

If ambiguity collapses to only one distinct label, Stormlog keeps the canonical ambiguity and omits phase_summary because there is no materially different winner to show.

Legacy compatibility

Legacy conversion is permissive by default in stormlog.telemetry.telemetry_event_from_record. Legacy conversion is attempted only when schema_version is absent.

If schema_version is present:

  • it must be an integer

  • it must be a supported schema version

  • any other value is rejected without legacy fallback

Legacy defaults:

  • missing pid -> -1

  • missing host -> "unknown"

  • missing device_id -> inferred from device if possible, otherwise -1

  • missing allocator_reserved_bytes -> allocator_allocated_bytes

  • missing allocator_change_bytes -> 0

  • missing device_used_bytes -> allocator_allocated_bytes

  • missing device_total_bytes and device_free_bytes -> null

  • missing event_type -> type field if present, else "sample"

  • missing distributed identity -> job_id: null, rank: 0, local_rank: 0, world_size: 1

  • legacy metadata_* fields are folded into the canonical metadata object

  • legacy artifacts without session_id receive a deterministic synthetic session id during load

If a legacy record is missing a valid timestamp, conversion fails.

Distributed env inference

Tracker constructors can infer distributed identity from common launcher env vars:

  • PyTorch / torchrun: RANK, LOCAL_RANK, WORLD_SIZE, TORCHELASTIC_RUN_ID

  • Open MPI: OMPI_COMM_WORLD_RANK, OMPI_COMM_WORLD_LOCAL_RANK, OMPI_COMM_WORLD_SIZE

  • Slurm: SLURM_PROCID, SLURM_LOCALID, SLURM_NTASKS, SLURM_JOB_ID

CLI and Python API callers can override these values explicitly.

Python API

Use the public conversion, validation, and loading helpers in stormlog.telemetry:

from stormlog.telemetry import (
    load_telemetry_events,
    load_telemetry_sessions,
    telemetry_event_from_record,
    telemetry_event_to_dict,
    validate_telemetry_record,
)
  • load_telemetry_sessions(path, permissive_legacy=True, events_key=None)

  • load_telemetry_events(path, permissive_legacy=True, events_key=None, session_id=None)

  • telemetry_event_from_record(record, permissive_legacy=True, ...)

  • validate_telemetry_record(record)

These APIs normalize legacy records to canonical schema_version: 3 events and enforce required fields.

Append-only sink layout

Always-on track sessions can also write append-only telemetry into a sink directory during the run:

telemetry_sink/
  manifest.json
  segment-000001.jsonl
  segment-000002.jsonl
  • each JSONL line is one canonical telemetry record

  • manifest.json is schema v2 and tracks segment ordering, per-segment session_id, and a session ledger

  • closed segments may be pruned when file-count or total-size retention limits are hit

  • on recovery, previously running sessions are closed as interrupted and new writes start in a new session

load_telemetry_events() can read:

  • a normal JSON export

  • a sink directory

  • manifest.json

  • an individual JSONL segment

If the final JSONL line is truncated because the process was interrupted during a write, the loader ignores that incomplete tail and still returns the fully written records ahead of it. If the artifacts cannot prove recovered ownership, the loader classifies the older capture as incomplete.

Diagnose and OOM manifests

Standalone diagnose bundles now write manifest schema v2 and include:

  • session_id

  • session_status

  • session

OOM flight-recorder bundles now write manifest schema v2 and metadata that reference the owning tracking session_id directly. That makes it possible to tie a bundle back to the exact capture that emitted the OOM.

Reconstructing a capture

from stormlog.telemetry import load_telemetry_events, load_telemetry_sessions

sessions = load_telemetry_sessions("./live_sink")
selected = sessions[0]
events = load_telemetry_events(
    "./live_sink",
    session_id=selected.summary.session_id,
)

print(selected.summary.session_id)
print(selected.summary.status)
print(len(events))