# Performance guide MHX performance reporting is designed for reproducible engineering comparisons, not for hardware-independent pass/fail claims. The active benchmark matrix is small enough for CI and produces artifacts that can be downloaded and compared between commits. ## Run timing artifacts ```bash mhx benchmark timing --outdir outputs/benchmarks/timing --repeats 3 --warmups 1 ``` Expected files: - `outputs/benchmarks/timing/timing.json` - `outputs/benchmarks/timing/timing.md` - `outputs/benchmarks/timing/figures/timing_summary.png` - `outputs/benchmarks/timing/manifest.json` The JSON schema is `mhx.benchmark.timing.v1`. Each case records raw repeat durations, median/min/max wall time, peak Python allocations from `tracemalloc`, and environment metadata including Python, JAX, NumPy, and the selected JAX backend. ```{image} _static/performance/timing_summary.png :alt: MHX FAST benchmark timing summary :width: 780px ``` ## Current cases | Case | What it exercises | | --- | --- | | `linear_tearing_fast` | Config loading, periodic spectral derivatives, RK4 stepping, diagnostics, and reduced-MHD RHS evaluation. | | `resistive_decay_fast` | Exact Fourier-mode resistive diffusion gate with numerical error diagnostics. | | `reconnection_scaling` | Analytic FKR, Sweet-Parker plasmoid, and ideal-tearing scaling scaffolds. | ## Interpreting the numbers - Compare timings only on the same machine class or the same GitHub Actions runner type. - CI verifies finite positive timings and required artifact files; it does not enforce absolute runtime thresholds. - `tracemalloc` reports Python allocations, not GPU/TPU device memory. Future accelerator benchmarks should add backend-specific memory probes. - JAX compilation and caching can dominate small cases. Use `--warmups` to remove first-call overhead when comparing local changes. ## Cheap CI coverage The documentation/CI checks are intentionally cheap enough to run on every push: - `python -m ruff check src tests examples tools` catches import/style drift before tests start. - `python tools/check_legacy_imports.py` prevents new imports from the archived implementation. - `python -m pytest tests/test_docs_links.py` checks that required docs pages are in the Sphinx toctree and that reviewer-facing source links still point at repository paths. - `python -m pytest tests/test_readme_media.py` checks README GIF links, compactness, visual-QA metadata, and minimum simulation durations for landing page media. - `sphinx-build -W -b html docs docs/_build/html` builds docs with warnings as failures. The expensive physics artifact matrix remains in `benchmark-artifacts`. That job records timing artifacts, but still avoids absolute runtime thresholds because GitHub-hosted runner performance is not stable enough for hardware-free claims. ## Performance knobs The active TOML config exposes the first controls users should tune: ```toml [mesh] shape = [32, 32] [time] t1 = 0.1 dt = 0.01 save_every = 1 [numerics] enable_x64 = true enable_jit = true ``` Larger `mesh.shape` values increase spectral FFT cost and trajectory storage. Smaller `dt` improves temporal resolution but increases the number of RHS evaluations. Larger `save_every` reduces IO and plotting memory. X64 is used in physics validation gates; exploratory performance runs may use X32 after a regression check confirms diagnostics remain stable. ## Long-run trajectory memory The fixed-step RK4 integrator stores only saved states. Internally it advances `save_every` RK4 steps inside each saved-sample scan chunk, rather than storing all internal steps and slicing afterward. This matters for long nonlinear campaigns: a `160×160`, `t_end=220`, `dt=0.02`, `save_every=110` double-Harris replay initially requested about 2.1 GiB for one full internal trajectory buffer on the `office` RTX A4000 node. After chunked saving, the same bounded validation run completed and wrote 101 saved samples with finite diagnostics. Practical guidance: - Increase `save_every` when the analysis only needs coarse movies or growth histories. - Keep `dt` controlled by physics/stability, not by output cadence. - For GPU runs, set `XLA_PYTHON_CLIENT_PREALLOCATE=false` when sharing a GPU. - Treat very long reverse-mode differentiable runs separately: checkpointing or custom adjoints are still needed for memory-efficient gradients through production trajectories. ## Source links - [Timing implementation](https://github.com/uwplasma/MHX/blob/main/src/mhx/benchmarks/timing.py) - [Fixed-step RK4 integrator](https://github.com/uwplasma/MHX/blob/main/src/mhx/time_integrators/fixed_step.py) - [Timing tests](https://github.com/uwplasma/MHX/blob/main/tests/test_timing_benchmark.py) - [CI artifact workflow](https://github.com/uwplasma/MHX/blob/main/.github/workflows/ci.yml)