# Performance guide

MHX performance reporting is designed for reproducible engineering comparisons,
not for hardware-independent pass/fail claims. The active benchmark matrix is
small enough for CI and produces artifacts that can be downloaded and compared
between commits.

## Run timing artifacts

```bash
mhx benchmark timing --outdir outputs/benchmarks/timing --repeats 3 --warmups 1
```

Expected files:

- `outputs/benchmarks/timing/timing.json`
- `outputs/benchmarks/timing/timing.md`
- `outputs/benchmarks/timing/figures/timing_summary.png`
- `outputs/benchmarks/timing/manifest.json`

The JSON schema is `mhx.benchmark.timing.v1`. Each case records raw repeat
durations, median/min/max wall time, peak Python allocations from
`tracemalloc`, and environment metadata including Python, JAX, NumPy, and the
selected JAX backend.

```{image} _static/performance/timing_summary.png
:alt: MHX FAST benchmark timing summary
:width: 780px
```

## Current cases

| Case | What it exercises |
| --- | --- |
| `linear_tearing_fast` | Config loading, periodic spectral derivatives, RK4 stepping, diagnostics, and reduced-MHD RHS evaluation. |
| `resistive_decay_fast` | Exact Fourier-mode resistive diffusion gate with numerical error diagnostics. |
| `reconnection_scaling` | Analytic FKR, Sweet-Parker plasmoid, and ideal-tearing scaling scaffolds. |

## Interpreting the numbers

- Compare timings only on the same machine class or the same GitHub Actions
  runner type.
- CI verifies finite positive timings and required artifact files; it does not
  enforce absolute runtime thresholds.
- `tracemalloc` reports Python allocations, not GPU/TPU device memory. Future
  accelerator benchmarks should add backend-specific memory probes.
- JAX compilation and caching can dominate small cases. Use `--warmups` to
  remove first-call overhead when comparing local changes.

## Cheap CI coverage

The documentation/CI checks are intentionally cheap enough to run on every push:

- `python -m ruff check src tests examples tools` catches import/style drift
  before tests start.
- `python tools/check_legacy_imports.py` prevents new imports from the archived
  implementation.
- `python -m pytest tests/test_docs_links.py` checks that required docs pages are
  in the Sphinx toctree and that reviewer-facing source links still point at
  repository paths.
- `python -m pytest tests/test_readme_media.py` checks README GIF links,
  compactness, visual-QA metadata, and minimum simulation durations for landing
  page media.
- `sphinx-build -W -b html docs docs/_build/html` builds docs with warnings as
  failures.

The expensive physics artifact matrix remains in `benchmark-artifacts`. That
job records timing artifacts, but still avoids absolute runtime thresholds
because GitHub-hosted runner performance is not stable enough for hardware-free
claims.

## Performance knobs

The active TOML config exposes the first controls users should tune:

```toml
[mesh]
shape = [32, 32]

[time]
t1 = 0.1
dt = 0.01
save_every = 1

[numerics]
enable_x64 = true
enable_jit = true
```

Larger `mesh.shape` values increase spectral FFT cost and trajectory storage.
Smaller `dt` improves temporal resolution but increases the number of RHS
evaluations. Larger `save_every` reduces IO and plotting memory. X64 is used in
physics validation gates; exploratory performance runs may use X32 after a
regression check confirms diagnostics remain stable.

## Long-run trajectory memory

The fixed-step RK4 integrator stores only saved states. Internally it advances
`save_every` RK4 steps inside each saved-sample scan chunk, rather than storing
all internal steps and slicing afterward. This matters for long nonlinear
campaigns: a `160×160`, `t_end=220`, `dt=0.02`, `save_every=110`
double-Harris replay initially requested about 2.1 GiB for one full internal
trajectory buffer on the `office` RTX A4000 node. After chunked saving, the same
bounded validation run completed and wrote 101 saved samples with finite
diagnostics.

Practical guidance:

- Increase `save_every` when the analysis only needs coarse movies or growth
  histories.
- Keep `dt` controlled by physics/stability, not by output cadence.
- For GPU runs, set `XLA_PYTHON_CLIENT_PREALLOCATE=false` when sharing a GPU.
- Treat very long reverse-mode differentiable runs separately: checkpointing or
  custom adjoints are still needed for memory-efficient gradients through
  production trajectories.

## Source links

- [Timing implementation](https://github.com/uwplasma/MHX/blob/main/src/mhx/benchmarks/timing.py)
- [Fixed-step RK4 integrator](https://github.com/uwplasma/MHX/blob/main/src/mhx/time_integrators/fixed_step.py)
- [Timing tests](https://github.com/uwplasma/MHX/blob/main/tests/test_timing_benchmark.py)
- [CI artifact workflow](https://github.com/uwplasma/MHX/blob/main/.github/workflows/ci.yml)