[← Back to Production Cookbook](index.md)

# TensorFlow Production Recipes

Use these recipes when the runtime is TensorFlow and you need production-safe
capture, analysis, and diagnosis flows that match the current `tfmemprof`
behavior.

Audience: ML engineers, release owners.
Difficulty: intermediate.

## Prerequisites

- install the package first with [Installation](../installation.md)
- use `pip install "stormlog[tf]"` for the TensorFlow CLI paths
- use [Command Line Guide](../cli.md) if you need per-flag reference
- pick `/CPU:0` or `/GPU:0` explicitly for the current runtime
- for GPU recipes, run a small GPU op first instead of relying only on
  `tf.config.list_physical_devices("GPU")`

Success signal:

- the first workload-backed recipe records non-zero GPU memory
- a monitor, track, or diagnose artifact is written successfully
- the analyzer returns a report with a clear next action

## Choose the first TensorFlow recipe

| If the main goal is... | Start with... |
| --- | --- |
| check a small in-process workload first | profile a GPU matmul step |
| capture a bounded sample window | `monitor` |
| keep an event stream and session status | `track` |
| get a report with leak and optimization signals | `analyze` |
| save a portable bundle fast | `diagnose --duration 0` |

## Recipe: profile a GPU matmul step

```python
import tensorflow as tf
from stormlog.tensorflow import TFMemoryProfiler

profiler = TFMemoryProfiler(device="/GPU:0", enable_tensor_tracking=True)

with profiler.profile_context("matmul_step"):
    a = tf.random.normal((4096, 4096))
    b = tf.random.normal((4096, 4096))
    c = tf.matmul(a, b)
    _ = tf.reduce_mean(c).numpy()

results = profiler.get_results()
print(f"Peak memory: {results.peak_memory_mb:.2f} MB")
print(f"Snapshots captured: {len(results.snapshots)}")
```

Use this when you want a small TensorFlow workload on `/GPU:0` without
depending on the cuDNN training path.

## Recipe: capture a bounded CLI timeline

```bash
tfmemprof monitor --interval 0.5 --duration 30 --device /CPU:0 --output ./tf_monitor.json
```

Switch to `/GPU:0` when the TensorFlow runtime exposes a GPU device.

Treat this as a CLI artifact-flow command. On an otherwise idle runtime it will
often record zeros even when the tracker itself is functioning correctly.

## Recipe: track TensorFlow memory over time

```bash
tfmemprof track \
  --interval 0.5 \
  --threshold 4096 \
  --device /CPU:0 \
  --output ./tf_track.json
```

Use `track` when you need retained vs dropped history counters, session status,
and an event stream you can reload later.

Stop the command cleanly with `Ctrl+C` so the output file is flushed before the
process exits.

## Recipe: run TensorFlow analysis

```bash
tfmemprof analyze --input ./tf_monitor.json --detect-leaks --optimize --report ./tf_report.txt
```

The current TensorFlow analyzer uses `--input`, not the positional-input style
from `gpumemprof analyze`.

## Recipe: produce a diagnose bundle

```bash
tfmemprof diagnose --duration 0 --output ./tf_diag_bundle
```

## Recipe: run the end-to-end TensorFlow flow from a source checkout

```bash
python -m examples.scenarios.tf_end_to_end_scenario
```

This is source-checkout only. Pip installs do not include `examples/`.

## What to look for in the results

- leak findings from `--detect-leaks`
- optimization recommendations from `--optimize`
- `collector_failure_event_count`
- `session_status`
- `gap_analysis` when telemetry events are available
- `collective_attribution` when cross-rank communication likely explains hidden-memory spikes

## What to do next

- If leak detection reports high-severity issues, move to the memory-growth path
  in [Incident Playbooks](incidents.md).
- If `collector_failure_event_count` is non-zero, use the degraded-collector
  checklist in [Incident Playbooks](incidents.md).
- If the run is part of a larger distributed job, move to the
  [Distributed Diagnostics Recipes](distributed.md).
- If the deployment is long-running, move to [Always-on Tracking](always_on.md).

## Troubleshooting

### Symptom: `track` stops without writing the output file

Likely cause: the process was interrupted before the tracker reached its normal shutdown path.
Fix: wait until tracking has started, then stop it cleanly with `Ctrl+C`.
Verify: the output file is written and `session_status` is present.

### Symptom: TensorFlow sees `/GPU:0` but a training step fails with `DNN library initialization failed`

Likely cause: the TensorFlow, CUDA, cuDNN, and driver stack is not aligned for
training-backed ops.
Fix: rerun the minimal matmul snippet above first, then repair the TensorFlow
runtime before moving to Keras or cuDNN-dependent workloads.
Verify: the matmul snippet records non-zero GPU memory and the training path no
longer raises a DNN initialization error.

### Symptom: `monitor` or `track` on `/GPU:0` records only zeros

Likely cause: the CLI process is idle and not exercising a TensorFlow workload.
Fix: use the workload-backed `TFMemoryProfiler` snippet or the source-checkout
scenario before relying on the artifact for performance conclusions.
Verify: the workload-backed path records non-zero GPU memory.

### Symptom: `analyze` reports no GPU

Likely cause: the TensorFlow runtime is CPU-only or the selected device is unavailable.
Fix: rerun with `--device /CPU:0` or fix the runtime environment first.
Verify: `tfmemprof info` and the chosen capture command agree on the active device.

---

[← Back to Production Cookbook](index.md)