Campaign runner operations

This page describes how MHX should move from FAST validation artifacts to long nonlinear reconnection campaigns. It is intentionally operational: reviewers should be able to see what was run, why it was long enough, what files were written, and why a result is or is not a production physics claim.

Current runner status

MHX currently ships six campaign-level commands:

mhx campaign rutherford-template --outdir outputs/campaigns/rutherford_template
mhx campaign rutherford-run-fast --outdir outputs/campaigns/rutherford_fast
mhx campaign rutherford-plan-production --outdir outputs/campaigns/rutherford_production_plan
mhx campaign rutherford-resume-plan outputs/campaigns/rutherford_production_plan
mhx campaign rutherford-execute outputs/campaigns/rutherford_production_plan --max-steps 128
mhx campaign rutherford-promotion-check outputs/campaigns/rutherford_production_plan

The first command writes a duration-guarded production template. The second command runs a tiny deterministic nonlinear trajectory and writes the same diagnostic vocabulary used by the future Rutherford campaign. The second command is still claim_level = "validation" unless explicitly configured as smoke; it is not a production nonlinear result.

The third and fourth commands write a duration-guarded production plan, runbook, scheduler-neutral walltime chunks, checkpoint index, required output list, and resume selection contract. The fifth command is the actual restartable executor: it advances a real reduced-MHD chunk, writes a state checkpoint, appends production-history arrays, refreshes the resume plan, and writes figures. A partial chunk remains claim_level = "validation"; a completed target run can only emit claim_level = "production" if the explicit production-claim gate is enabled, all execution checks pass, and a passing promotion-readiness report is attached under <run-dir>/promotion/. The sixth command writes that promotion-readiness report and exits nonzero until the missing evidence is attached.

Source links:

Duration gate

For a mode with linear growth rate \(\gamma\), any growth or island-formation claim must satisfy

\[ t_\mathrm{end} \ge s_f\frac{N_e}{\gamma}. \]

Here \(N_e\) is the number of required e-folds and \(s_f\) is a safety factor. The default Harris reference used by MHX is \(\gamma\simeq0.0131\), so:

\[ 10/\gamma \approx 763.4,\qquad 3\times10/\gamma\approx2290.1. \]

The first number is a minimum linear-growth observation window. The second is the default Rutherford-template window, leaving room for a resolved linear phase plus nonlinear island tracking. A shorter run may still be useful as a code-validity gate, but it must not be labeled as nonlinear reconnection evidence.

FAST runner artifacts

The FAST runner writes:

  • rutherford_fast_histories.npz

  • diagnostics.json

  • validation.json

  • campaign_template.json

  • manifest.json

  • figures/rutherford_fast_histories.png

  • optionally figures/flux_movie.gif

The NPZ history keys are:

Key

Meaning

time

Saved times for each trajectory.

seed

Seed associated with each saved trajectory row.

reconnected_flux

Reconnecting-mode flux proxy.

rutherford_island_width

$W=4\sqrt{

reconnection_rate_proxy

Finite-difference time derivative of reconnecting flux.

magnetic_energy, kinetic_energy, total_energy

Reduced-MHD energy diagnostics.

magnetic_divergence_linf

Spectral solenoidal-field check.

current_density_linf

Current-density magnitude proxy.

These names are intentionally the same names that should appear in a future production Rutherford campaign. That lets plotting, manifests, and reviewers reuse one schema.

Production planning and execution artifacts

The production planner writes these files before running the expensive PDE:

  • campaign_plan.json with schema mhx.campaign.rutherford_production_plan.v1;

  • campaign_config.toml with the effective long-run configuration;

  • validation.json with duration, resolution, walltime, and artifact gates;

  • runbook.md with the launch/restart checklist;

  • job_array.json with scheduler-neutral walltime chunks;

  • checkpoints/checkpoint_index.json with schema mhx.campaign.rutherford_checkpoint_index.v1;

  • manifest.json with claim_level = "production_template".

The executor then writes:

  • production_history.npz with schema mhx.campaign.rutherford_history.v1;

  • diagnostics.json with schema mhx.campaign.rutherford_execution.v1;

  • validation.json with schema mhx.campaign.rutherford_execution.gates.v1;

  • checkpoints/state_step_*.npz with schema mhx.campaign.rutherford_state.v1;

  • checkpoints/step_*.json checkpoint metadata with hashes;

  • resume_plan.json;

  • figures/production_histories.png;

  • figures/current_sheet_aspect_ratio.png;

  • optional figures/fixed_scale_flux_movie.gif and figures/fixed_scale_current_density_movie.gif;

  • artifact_manifest.json and an updated manifest.json.

The promotion checker then writes:

  • promotion/promotion_readiness.json with schema mhx.campaign.rutherford_promotion.v1;

  • promotion/validation.json with schema mhx.campaign.rutherford_promotion.gates.v1;

  • promotion/figures/promotion_matrix.png;

  • promotion/artifact_manifest.json and promotion/manifest.json.

The promotion gate is deliberately stricter than the executor. It requires a completed target, finite histories, current-sheet geometry, detected X/O critical-point counts, fixed-scale movies unless explicitly disabled, convergence evidence, seed-QI evidence, and tolerances on energy-budget residuals and magnetic-divergence error. It also requires positive nonlinear response: by default, both the peak/initial reconnecting-flux amplification and the peak/initial Rutherford-width amplification must exceed 1.05. These thresholds are configurable with --min-reconnected-flux-amplification and --min-island-width-amplification on mhx campaign rutherford-promotion-check. For production-candidate Rutherford/island runs, plan the executor with the same unstable periodic double-Harris equilibrium used by the nonlinear validation lane:

mhx campaign rutherford-plan-production \
  --outdir outputs/campaigns/rutherford_production \
  --equilibrium periodic_double_harris \
  --width 0.36 --perturbation-amplitude 0.004 \
  --mode-x 2 --mode-y 1 --eta 0.0045 --nu 0.0045 \
  --nx 128 --ny 128 --dt 0.02 --target-saved-frames 121

The older cosine_tearing default is retained as a decaying executor/checkpoint schema sanity path. It is not sufficient for a nonlinear response promotion because its reconnecting-flux and island-width histories are expected to fail the amplification gates.

For a laptop-safe closed-lane example:

mhx campaign rutherford-plan-production \
  --outdir outputs/campaigns/rutherford_executor_demo \
  --nx 8 --ny 8 --dt 1e-2 --target-saved-frames 120 \
  --min-production-resolution 8

mhx campaign rutherford-execute \
  outputs/campaigns/rutherford_executor_demo \
  --max-steps 8 --movies

mhx campaign rutherford-promotion-check \
  outputs/campaigns/rutherford_executor_demo \
  --no-require-movies --min-history-samples 2 || true

The final command is expected to fail for this tiny demo because the target is not complete and no convergence/seed-QI bundles are attached. The purpose is to write promotion/figures/promotion_matrix.png so the missing evidence is explicit rather than implicit.

Restartable Rutherford production histories

Fixed-scale flux movie

Fixed-scale current-density movie

The checkpoint index starts empty. Long-run executors register restartable state files with:

from mhx.campaigns import write_checkpoint_metadata

write_checkpoint_metadata(
    "outputs/campaigns/rutherford_production_plan",
    step=1000,
    time=100.0,
    state_path="checkpoints/state_0000001000.npz",
    history_path="histories.npz",
    metrics={
        "total_energy": 0.997,
        "magnetic_divergence_linf": 2.0e-13,
    },
)

Each checkpoint record stores the step, physical time, walltime spent, metrics, artifact paths, file sizes, and SHA-256 hashes. mhx campaign rutherford-resume-plan <run-dir> selects the latest valid checkpoint at or before the target step. If files are missing or hashes change, the resume plan marks the checkpoint invalid and falls back to step zero rather than silently continuing from a corrupted state.

Production campaign acceptance criteria

A production Rutherford or plasmoid campaign should pass all of the following:

Requirement

Reviewer reason

Duration guard passes for the declared \(\gamma\), \(N_e\), and \(s_f\).

Prevents short runs from being interpreted as nonlinear evolution.

At least two spatial resolutions and two time steps are archived.

Separates physics from discretization artifacts.

Energy-budget residual remains below the documented tolerance.

Checks bracket cancellation and dissipation signs during the long run.

Magnetic divergence remains near spectral roundoff or a documented tolerance.

Catches projection/derivative mistakes.

Seed-robust QI is run on the production diagnostic family.

Checks that the reported metrics are not seed accidents.

mhx campaign rutherford-promotion-check passes.

Blocks production claims until convergence, seed-QI, current-sheet geometry, X/O point counts, fixed-scale media, tolerances, and positive reconnecting-flux/island-width response are present.

Flux/current movies use fixed color limits and include timestamps.

Makes visual comparisons honest across resolutions and seeds.

Artifact manifests include hashes, config, git commit, API version, and dependencies.

Makes reviewer reruns and diffs possible.

Dry-run production automation lane

tools/run_nonlinear_production_campaign.py is the dry-run-first launcher for a true nonlinear production lane. By default it does not run the solver; it writes production_campaign_manifest.json and run_commands.sh with exact commands, timeouts, expected outputs, and claim boundaries:

python tools/run_nonlinear_production_campaign.py \
  --outdir outputs/campaigns/nonlinear_production_campaign \
  --dry-run

The manifest includes Rutherford duration planning/execution, a seeded double-Harris production-duration movie run, two convergence bundles, seed-QI, sheet-width plus aspect-ratio evidence, eta/Lundquist sweeps, fixed-scale movie gates, double-Harris validation promotion, Rutherford promotion, and a final --allow-production-claim refresh command. The double-Harris promotion command can only promote media to convergence-backed validation. Rutherford artifacts remain claim_level = "validation" unless the target duration completes, the promotion report passes with convergence/seed/movie/response evidence, and the final zero-step production-claim refresh succeeds.

By default the campaign fits the early growth rate only through the larger of three saved samples or two declared Harris e-folding times, fit_stop = min(t_end, max(3*save_interval, 2/gamma)). This keeps the growth fit out of nonlinear saturation for bounded GPU campaigns. Use --fit-stop when a case-specific linear window is known from an eigenmode or pilot run. When --save-interval is omitted, the convergence and sweep stages inherit the long-run saved-frame cadence, save_interval = save_every*dt, so the default campaign uses the same physical sampling cadence across long-run and auxiliary gates. The seed-QI command also receives the same --save-every stride; this keeps long seed ensembles from storing every RK4 step on GPU.

Execution is intentionally opt-in and bounded per command:

python tools/run_nonlinear_production_campaign.py \
  --outdir outputs/campaigns/nonlinear_production_campaign \
  --execute \
  --timeout-seconds 43200 \
  --gate-timeout-seconds 1800

Use the generated shell script only when an external scheduler owns walltime. The Python --execute path enforces portable subprocess timeouts and updates the manifest with pass/fail/timeout records after each command.

Proposed production directory

outputs/campaigns/rutherford_production/
  campaign_config.toml
  campaign_plan.json
  runbook.md
  job_array.json
  diagnostics.json
  validation.json
  production_history.npz
  resume_plan.json
  checkpoints/
    checkpoint_index.json
    state_step_*.npz
    step_*.json
  convergence/
    resolution_sweep/
    timestep_sweep/
  seed_qi/
  figures/
    production_histories.png
    current_sheet_aspect_ratio.png
    fixed_scale_flux_movie.gif
    fixed_scale_current_density_movie.gif
  promotion/
    promotion_readiness.json
    validation.json
    figures/promotion_matrix.png
    artifact_manifest.json
    manifest.json
  artifact_manifest.json
  manifest.json

This layout is now automated for chunked execution through mhx campaign rutherford-execute. The remaining hard boundary is not the executor itself; it is running enough chunks at production resolution, then attaching convergence sweeps, seed-QI checks, fixed-scale movies, and a passing promotion report before claiming a paper-grade nonlinear result.

Reviewer questions to answer before claiming production

  1. Which growth rate set the duration?

  2. How many e-folds were covered before nonlinear island tracking?

  3. Did island width show the expected algebraic phase after the linear phase?

  4. Did the energy budget remain closed over the full run?

  5. Did the result persist under resolution, time-step, and seed changes?

  6. Are flux/current movies plotted with fixed ranges?

  7. Are all files reproducible from the checked-in command sequence?

If the answer to any question is unknown, the result should remain a validation artifact, not a paper claim.