[← Back to main docs](index.md)

# Benchmark Harness (v0.4)

> **Source checkout only.** `python -m examples.cli.benchmark_harness` requires
> the repository `examples/` package and `docs/benchmarks/`. It is not shipped
> in the PyPI package.

The v0.4 benchmark harness measures always-on monitoring as a benchmarked
operability budget, not just as a point-in-time benchmark.

It still supports the same two gate modes:

- `budget`: compare current metrics to absolute max thresholds
- `regression`: compare current metrics to a checked-in baseline plus allowed deltas

On top of that, v0.4 extends the benchmark coverage to:

- `gpumemprof` CPU fallback (`gpumemprof_cpu`)
- `tfmemprof --device /CPU:0` (`tfmemprof_cpu`)
- accelerated soak qualification
- rollover and retention validation
- history truncation diagnostics
- actionable failure reporting

## Default operating assumptions

The harness models the always-on default runtime mode as:

- `track` with append-only sink enabled
- `flush_every_seconds=2.0`
- `rollover_max_bytes=64 MB`
- `retention_max_files=8`
- `retention_max_total_bytes=512 MB`

Retention validation also runs a forced-churn subtest with tighter limits so
rollover and pruning are exercised even in fast local runs.

## Profiles

- `pr`: accelerated `6h`-equivalent soak plus default-interval overhead checks
- `nightly`: accelerated `24h`-equivalent soak plus the same overhead checks

“24h-equivalent” means the harness does not sleep between samples. Instead, it
collects the same number of samples that a 24-hour run would emit at the
runtime’s default interval:

- `gpumemprof_cpu`: `864000` samples at `0.1s`
- `tfmemprof_cpu`: `86400` samples at `1.0s`

## Modes

- `overhead`: run only the unprofiled vs tracked overhead comparison
- `soak`: run only the accelerated soak and retention validation
- `all`: run both

## Run the harness

```bash
python -m examples.cli.benchmark_harness \
  --profile pr \
  --mode all \
  --output artifacts/benchmarks/latest_v0.4.json
```

## Enforce Regression Gate

```bash
python -m examples.cli.benchmark_harness \
  --check \
  --profile pr \
  --mode all \
  --gate-mode regression \
  --iterations 5000 \
  --baseline docs/benchmarks/v0.4_baseline.json \
  --tolerances docs/benchmarks/v0.4_tolerances.json \
  --output artifacts/benchmarks/latest_v0.4_regression.json
```

This is the policy used by the pull-request memory gate in CI.
The checked-in regression assets intentionally cover only the default `pr`
profile.

## Enforce Budgets

```bash
python -m examples.cli.benchmark_harness \
  --check \
  --profile pr \
  --mode all \
  --gate-mode budget \
  --iterations 5000 \
  --budgets docs/benchmarks/v0.4_operating_budget.json \
  --output artifacts/benchmarks/latest_v0.4_budget.json
```

Use budget mode when you want a short benchmark run checked against absolute
operating thresholds rather than baseline deltas.

## Nightly operating-budget gate

```bash
python -m examples.cli.benchmark_harness \
  --check \
  --profile nightly \
  --mode all \
  --gate-mode budget \
  --iterations 5000 \
  --budgets docs/benchmarks/v0.4_operating_budget.json \
  --output artifacts/benchmarks/latest_v0.4_nightly.json
```

This keeps budget enforcement in place for the longer nightly soak profile, so
the short-run checks and long-run checks use the same benchmark policy model.

## What it measures

- `runtime_overhead_pct`: wall-clock overhead of the tracked default mode vs the unprofiled workload.
- `cpu_overhead_pct`: CPU-time overhead of the tracked default mode vs the unprofiled workload.
- `artifact_growth_bytes`: tracked-output size minus the unprofiled output size.
- `rss_growth_per_24h_equiv`: RSS delta normalized to a 24-hour-equivalent run.
- `max_rss_delta_bytes`: largest observed RSS increase above the soak baseline.
- `final_retained_files`: retained append-only segment count after pruning.
- `final_retained_bytes`: retained append-only bytes after pruning.
- `rollover_count`, `pruned_segment_count`, `pruned_bytes`: sink churn under sustained load.
- `history_dropped_*`: bounded-history eviction counts surfaced by the runtime.
- `collector_failure_event_count`: degraded/recovered collector transitions seen during the run.

## Output format

The v0.4 report includes:

- `profile`, `mode`, `gate_mode`
- `config`: comparison config plus runtime-specific sample counts
- `runtimes`: per-runtime overhead, soak, retention-validation, and diagnostic data
- `metrics`: flattened per-runtime metrics used for gating
- `budget_checks` or `regression_checks`
- `failure_diagnostics`: actionable failures with collector, sink, and history context
- `passed`

## Interpreting failures

Failure lines are intentionally verbose. A budget or regression failure includes:

- the failing metric and threshold
- the runtime name
- collector health state
- collector failure count
- rollover and prune counts
- retained file and byte totals
- retained and dropped history counters

Typical examples:

- overhead regression: runtime or CPU overhead jumped materially above baseline
- retention failure: retained files or bytes exceeded the configured sink budget
- collector failure: degraded-mode transitions occurred during the soak
- history drift: dropped-event or dropped-sample counts grew beyond the expected envelope

## Tuning order

When a run fails, adjust knobs in this order:

1. sampling interval
2. sink flush cadence
3. rollover size or rollover event count
4. retention file and byte limits
5. TensorFlow `max_history` if sample/event windows are too large for the deployment

## Versioned assets

The v0.4 harness reads:

- `docs/benchmarks/v0.4_operating_budget.json`
- `docs/benchmarks/v0.4_baseline.json`
- `docs/benchmarks/v0.4_tolerances.json`

Update these files only with an intentional benchmark refresh. Run the harness
with the same profile and config as CI, inspect the new metrics, then commit the
asset update separately from unrelated code.