[← Back to main docs](index.md)

# Stormlog Telemetry Projection

Stormlog uses a telemetry-first internal projection to give runtime capture,
append-only sink persistence, artifact loading, live display, and offline
analysis one shared event model.

`TelemetryEvent v3` remains Stormlog's canonical persisted event schema for
artifacts, append-only sink segments, exports, and loader output. The
backend-neutral envelope described here is an internal projection over that
schema, implemented in `stormlog.telemetry_model` and exposed through
`stormlog.telemetry`.

## Decision

Stormlog should use a telemetry-first internal projection as the shared
internal model for analysis and UI code. Runtime trackers should keep emitting
compact tracker-local events on the hot path, then normalize at tracker or
loader edges before persistence, display, or analysis fan-out.

The current implementation takes the first migration step: it keeps persisted
records on `TelemetryEvent v3`, then projects those records into
`ProjectedTelemetryRecord` for live and loaded sessions. Future sink migrations
can persist the projected envelope directly after benchmark gates prove the
runtime cost is acceptable.

## Current State Map

Stormlog currently has several event shapes and normalization boundaries:

- `TelemetryEventV3` is the stable memory telemetry contract used by artifact
  exports and loader output.
- Tracker `TrackingEvent` values stay lightweight for runtime capture and are
  normalized through `_telemetry_record_from_event(...)`.
- TensorFlow capture builds compatible telemetry dictionaries in the
  TensorFlow tracker path before exporting or loading them.
- Append-only sink JSONL entries persist normalized telemetry records and keep
  recovery, rollover, pruning, and manifest behavior separate from the record
  projection.
- Artifact loaders group loaded records into `LoadedTelemetrySession` objects
  after legacy V2, V3, JSON, and JSONL compatibility handling.
- The live TUI uses `TrackerSession` as the display-facing session model.

The projected telemetry envelope does not remove those boundaries yet. It gives
them a shared target shape so migration can happen one layer at a time without
changing the existing artifact schema.

## Implemented Model

The projected telemetry envelope is `ProjectedTelemetryRecord`. It is a small
immutable event envelope with these fields:

- `schema_version`: internal projected telemetry envelope version.
- `record_id`: deterministic identifier derived from the source telemetry
  record.
- `timestamp_ns`: event time from the source.
- `observed_timestamp_ns`: time Stormlog observed or normalized the event.
- `session_id`: capture/session identity.
- `source_kind`: backend family such as `cpu`, `cuda`, `rocm`, `mps`,
  `tensorflow`, generic `gpu`, or `other`.
- `event_type`: generic classification such as `sample`, `start`, `stop`,
  `phase_enter`, `phase_exit`, `warning`, `critical`, or `error`.
- `stage`: optional lifecycle or workload stage.
- `severity` and `severity_text`: normalized severity when meaningful.
- `body`: primary message or payload.
- `resource`: runtime identity such as host, process, backend, device, job, and
  rank.
- `attributes`: extensible metadata and backend-specific measurements.
- `correlation`: session, job, rank, phase, and future trace/span alignment
  fields.

The projection keeps backend-specific details out of top-level fields. Memory
counters from `TelemetryEvent v3`, collector health metadata, phase metadata,
and future backend details are represented as attributes, resources, or
correlation fields.

The smallest useful contract for runtime, sink, loader, TUI, and exported
artifact views is:

- identity: `schema_version`, `record_id`, and `session_id`,
- time: `timestamp_ns` and `observed_timestamp_ns`,
- classification: `source_kind`, `event_type`, `stage`, and severity fields,
- payload: `body`,
- projection data: `resource`, `attributes`, and `correlation`.

`source_kind` must stay a backend family. Distributed identity such as host,
process id, device id, rank, local rank, world size, and future node ids belongs
in `resource` or `correlation`, not in `source_kind`. Future backend families
can be added when Stormlog adds collector support for them.

Some identity values intentionally appear in more than one place. For example,
`session_id` is top-level and also available in `correlation`, while
`source_kind` is top-level and also available in `resource`. These copies are
derived from the same normalized source record and must not drift; adapters that
change one identity value must update every projected copy in the same
projection step.

## Projection Versioning

`TELEMETRY_PROJECTION_SCHEMA_VERSION` versions only the internal projected
envelope. It is separate from `TelemetryEvent v3` and does not change artifact
compatibility by itself.

When the projection shape changes incompatibly, update the version constant,
the `ProjectedTelemetryRecord.schema_version` type annotation, serialization
tests, this document, and compatibility behavior together. Compatibility code
should be explicit about whether it can read both projection versions or must
re-project from the persisted `TelemetryEvent v3` source.

## Data Flow

The current flow is:

1. Trackers capture compact runtime events or tracker-local records.
2. Existing normalizers produce `TelemetryEvent v3` records for exports, sink
   writes, loaders, and TUI adapters.
3. `project_telemetry_event(...)` projects V3 records into
   `ProjectedTelemetryRecord`.
4. `LoadedTelemetrySession.telemetry_records()` exposes projected telemetry
   records for loaded artifacts.
5. `TrackerSession.telemetry_records()` exposes projected telemetry records for
   live TUI sessions.

Because projection currently happens after existing V3 normalization, this keeps
the capture path unchanged while giving downstream code one stable
backend-neutral view. A future sink migration that persists the projected
envelope directly still needs benchmark gates for capture latency, allocations,
queue depth, and sink throughput before it can claim the same hot-path cost.

## Compatibility

Stormlog preserves existing compatibility boundaries:

- Existing V2, V3, and legacy artifacts remain loadable.
- `TelemetryEvent v3` remains the persisted artifact and append-only sink
  record format.
- Legacy artifact upcasting stays in loader and normalizer code.
- Telemetry projection is additive and does not change CLI, TUI, or Python API
  behavior.
- Legacy export shapes remain explicit compatibility paths.

This lets analysis and UI code adopt the projected telemetry envelope without
breaking older artifacts or changing the on-disk schema.

## Live and Loaded Sessions

Live and loaded data now share the same projected telemetry record shape:

- Loaded artifacts use `LoadedTelemetrySession.telemetry_records()`.
- Loaded artifacts use `LoadedTelemetrySession.resources()` for unique observed
  resources.
- Loaded artifacts use `LoadedTelemetrySession.correlations()` for unique
  correlation contexts.
- Live TUI sessions use `TrackerSession.telemetry_records()`.

The TUI can keep rendering lightweight view models, while analysis and future
query code can use the projected telemetry records regardless of whether the
source is a live tracker or an artifact.

## Performance Policy

Always-on capture protects application liveness first. The hot path should do
only the work required to capture local fields, timestamp events, look up
session/correlation identity, and append to bounded in-memory history or a
queue.

Capture code should avoid:

- blocking on sink persistence,
- heap-heavy serialization,
- reflection-heavy shaping,
- formatting strings before normalization when raw fields are enough.

Sink persistence should remain batched and append-only. Under pressure,
Stormlog should prefer explicit sampling or low-priority event shedding over
blocking the application being measured.

Pressure means any bounded buffer, queue, history, sink, or long-running session
cannot accept events at the current production rate without unbounded memory
growth or blocking the measured application. Examples include bounded history
overflow, sink queue backlog, slow disk flushes, bursty alert streams, and
always-on sessions that run longer than the retention budget.

Any sampling or shedding policy must be visible in telemetry and diagnostics.
At minimum, adapters and sinks should expose the active policy, queue or history
depth, dropped sample count, dropped event count, dropped alert count when
alerts are handled separately, and a reason for each class of drop.

## Benchmark Plan

Benchmark validation should compare the existing V3 flow against the projection
path before any persisted-format migration.

Measure:

- events per second sustained,
- average and p95 event capture latency,
- allocations per event,
- bytes allocated per second,
- queue or bounded-history depth under burst load,
- sink flush throughput,
- TUI update latency in live mode,
- artifact load time for large sessions,
- memory growth over long always-on runs.

Coverage should include:

- CPU-only workloads,
- GPU-heavy workloads,
- mixed backend workloads,
- quiet long-running always-on sessions,
- bursty error-heavy sessions.

The acceptance bar is that the projected telemetry envelope must not materially
harm the hot path. Any extra projection cost must be offset by simpler shared
sinks, loaders, UI adapters, and analysis code.

## Migration Plan

### Phase 1: Telemetry Projection

- Keep `TelemetryEvent v3` as the persisted artifact format.
- Project V3 records into `ProjectedTelemetryRecord`.
- Expose projected telemetry records from loaded sessions and live TUI sessions.
- Cover projection behavior with deterministic tests.

### Phase 2: Tracker Adapters

- Keep tracker-local runtime events compact.
- Move PyTorch, CPU, and TensorFlow normalization into shared adapter helpers.
- Centralize timestamp rules, session identity, severity mapping, backend tags,
  collector health, and distributed correlation.
- Add capture and enqueue counters for benchmark visibility.

### Phase 3: Sink Migration

- Add a future sink schema version for persisted projected telemetry records.
- Keep append-only JSONL semantics and deterministic serialization.
- Preserve rollover, pruning, recovery, and manifest behavior.
- Keep old sink loading paths for existing artifacts.

### Phase 4: Loader Migration

- Dispatch by artifact and sink schema version.
- Parse V2, V3, legacy JSON, JSONL, and future persisted projected telemetry
  records into the same internal stream.
- Keep compatibility transforms separate from primary parsing.
- Preserve old fixtures through loader adapters instead of rewriting user data.

### Phase 5: TUI and Session Unification

- Render live monitoring and diagnostics from projected telemetry records or
  derived view models built from projected telemetry records.
- Keep TUI-specific formatting outside the core telemetry model.
- Reuse the same query and marker logic for live and offline sessions.

### Phase 6: Benchmarks and Regression Gates

- Extend the benchmark harness with capture-latency, allocation, queue-depth,
  sink-throughput, load-time, and TUI-latency metrics.
- Compare current V3 normalization against the projected telemetry envelope and
  future persisted-envelope writes.
- Require regression and budget gates before enabling new persisted formats.

## Follow-On Tasks

- Tracker layer: introduce shared runtime-event-to-projection adapter helpers for
  PyTorch, CPU, and TensorFlow paths.
- Sink layer: add projected-record sink versioning while preserving append-only
  recovery and retention behavior.
- Loader layer: add version dispatch that upcasts V2, V3, and future persisted
  projected telemetry records into one internal stream.
- TUI/session layer: move more live and loaded views onto projected telemetry
  records.
- Benchmark layer: extend `examples.cli.benchmark_harness` and docs benchmark
  assets with overhead, allocation, queue-depth, load-time, and TUI-latency
  checks.
- Compatibility layer: add explicit legacy import/export adapters and document
  retirement criteria.

## Non-Goals

- Rewriting every tracker to emit fully normalized records immediately.
- Changing the existing `TelemetryEvent v3` JSON schema.
- Designing a new external protocol.
- Changing user-facing CLI, TUI, or Python API behavior.
- Optimizing for every future backend-specific field up front.