Architecture: BMA + isotonic + conformal¶

The framework is a deliberately small pipeline of three composed stages:

component model outputs (helios-spaceweather-connectors) │ ▼ ┌─────────────┐ │ BMA fuse │ weights from rolling 90-day skill (HSS-weighted) └─────────────┘ │ fused point estimate (probability or continuous) ▼ ┌──────────────────────────────┐ │ Reliability calibration │ severity-stratified isotonic regression └──────────────────────────────┘ │ calibrated probability ▼ ┌──────────────────────────────┐ │ Conformal interval │ Mondrian (per-Kp-stratum) split conformal └──────────────────────────────┘ │ ▼ FusedOutput with lineage + conformal_interval

Each stage is independent: the BMA orchestrator does not know about calibration; calibrators do not know about conformal prediction; the conformal regressor does not know about the upstream estimator. This separation is intentional — it makes each stage testable in isolation and lets the same framework score deterministic point predictions (skipping conformal), pure probability forecasts (skipping conformal width), or fully probabilistic continuous-quantity forecasts (the full stack).

Why this stack, vs. alternatives¶

Why BMA over a single ensemble model?¶

Component models in heliophysics have heterogeneous error structures across event types (CME-driven vs. flare-driven SEP onsets), severity regimes (quiet vs. extreme Kp), and time horizons. A single ML ensemble would require retraining whenever a component model updates its own underlying physics. BMA treats components as black-box probability streams and recombines them at runtime, so the framework continues to work when an upstream model is replaced or temporarily unavailable.

Why isotonic over Platt?¶

The proposal §2 Obj. 2 explicitly rejects Platt scaling. The rationale is implemented in tests/test_calibration.py::test_platt_worsens_calibration_at_extremes: on a synthetic stream that is well-calibrated at moderate probabilities but miscalibrated at extremes (the regime where SEP all-clear-revocation decisions live), Platt's two-parameter sigmoid cannot fit the localised tail miscalibration without distorting the moderate-probability middle. Isotonic regression is non-parametric in the relevant sense (monotone non-decreasing) and handles this regime gracefully.

The PlattCalibrator class is kept in the framework deliberately, so users can verify the rejection rationale on their own data.

Why severity-stratified isotonic?¶

Operational decisions at extreme-Kp conditions are exactly the decisions that matter most. An unstratified isotonic calibrator pools across the Kp distribution, where quiet samples dominate by volume — the extreme stratum contributes too few samples to the global fit to influence its knots. The proposal §2 Obj. 2 explicitly calls for severity-stratified calibration to "prevent calibration collapse on the events that matter most."

SeverityStratifiedCalibrator holds one IsotonicCalibrator per stratum. Each stratum's calibration knots are fitted only on samples from that stratum.

Why Mondrian conformal over standard split conformal?¶

Standard split conformal achieves marginal coverage — the requested 1 - alpha coverage rate holds on average across all samples. It does NOT guarantee per-stratum coverage. The OSF pre-registration requires the reliability slope to fall within 0.15 of 1.0 per stratum; the matching conformal-coverage discipline is Mondrian conformal (Vovk 2003), which carries a separate residual quantile per stratum.

MondrianConformalRegressor is built on top of SplitConformalRegressor and routes per-sample stratum labels to the matching sub-regressor at both fit time and predict time.

Composition rules¶

Three rules govern how to compose the stages safely:

Disjoint data per stage: the BMA orchestrator's verification window, the calibrator's fit set, and the conformal regressor's calibration set must be disjoint (or, at least, the conformal calibration set must be disjoint from the data used to fit the calibrator). Otherwise the conformal coverage guarantee is invalidated.
Order matters: BMA → calibrate → conform. Calibrating BEFORE fusing would calibrate each component independently and lose the cross-model weighting; conforming BEFORE calibrating would produce intervals on a miscalibrated point estimate.
Stratum labels travel: the severity stratum label is carried on every record so the right calibrator and the right conformal regressor are selected per sample.

Lineage¶

Every fused output records a LineageStep per transformation. The bma_fuse step records the weights used (post-renormalisation if any configured models were missing) and the list of excluded models. The calibration and conformal steps can be appended to the lineage by callers that compose the stages downstream of the orchestrator.

The schema version on every record matches the eventual helios-provenance-spec v0.1 contract — see types.py for the verbatim field shapes.