When we say 60%,
does it happen 60% of the time?
This is the dated bet: a forecast is only honest if it's scored, and scored under a proper rule. The walk-forward backtest below trains the model on history up to a cutoff and grades it on the next six months — twelve times, sliding forward — so the headline metrics are out-of-sample rather than self-reported. Live in-tournament scoring is wired up below the backtest and switches on once Group A kicks off.
Accuracy over time
By rolling foldA flat line is the dream — it means model quality is stable as new tournaments are added. A trend tells you something is shifting (better data, regime change, or, awkwardly, overfit).
RPS & sharpness per fold
Walk-forwardBoth scores are proper — neither rewards over-confidence. RPS rewards correctly ordered probabilities (a home-win prediction near 0.6 helps even if the team loses). Sharpness penalises any miss equally.
Walk-forward folds
Train ≤ cutoff · score next 180 days| Cutoff | Train n | Test n | RPS | Sharpness | Accuracy | Fit s |
|---|---|---|---|---|---|---|
| 2018-01-01 | 39815 | 448 | 0.1718 | 0.5123 | 62.1% | 0.44 |
| 2019-01-01 | 40681 | 842 | 0.1547 | 0.459 | 64.7% | 0.47 |
| 2020-01-01 | 41758 | 243 | 0.1902 | 0.5809 | 51.0% | 0.45 |
| 2021-01-01 | 42104 | 900 | 0.1454 | 0.4543 | 65.0% | 0.43 |
| 2022-01-01 | 43220 | 610 | 0.186 | 0.5447 | 57.7% | 0.44 |
| 2023-01-01 | 44174 | 752 | 0.1679 | 0.4903 | 61.6% | 0.46 |
| 2024-01-01 | 45177 | 984 | 0.1649 | 0.5099 | 58.9% | 0.43 |
| 2025-01-01 | 46401 | 756 | 0.1528 | 0.4615 | 63.0% | 0.49 |
Calibration · predicted vs observed
Pooled across foldsDots near the diagonal mean "when we say 30 %, it happens 30 % of the time". Above the line = under-confident; below = over-confident. Symmetry across the three outcomes tells you the model isn't dragging probability between draws and decisive results.
What's tunable
6 yr · how quickly old matches fade. Lower = more responsive, more variance. The 24-cell sweep showed NLL is flat across [4, 8] years.
0.3 · pulls under-observed teams toward zero. Critical for tiny FAs like Bhutan.
Friendly 0.50 · WC main 1.60, matching Nate Silver's PELE midpoints.
Travel distance HFA · altitude · negative-binomial draw correlation.
Live calibration starts 2026-06-11
Once matches start, every fixture's pre-match probability is locked in and scored against the actual W/D/L outcome. A running reliability plot (this tournament only) appears here, separately from the multi-decade backtest above, so a reader can see whether the model holds up on this draw.
The companion model-vs-market page scores the same fixtures against the closing market line, so both "are we calibrated?" and "do we beat the market?" are answered on the same fixture set. Post-tournament, this page becomes the survivor artifact for the lab.