Applied AI / Agent Lab Book a Discovery Session
06
Section 06 · Calibration scorecard

When we say 60%,
does it happen 60% of the time?

This is the dated bet: a forecast is only honest if it's scored, and scored under a proper rule. The walk-forward backtest below trains the model on history up to a cutoff and grades it on the next six months — twelve times, sliding forward — so the headline metrics are out-of-sample rather than self-reported. Live in-tournament scoring is wired up below the backtest and switches on once Group A kicks off.

Folds
8
5535 test matches
Mean RPS
0.1629
Lower = sharper
Accuracy
61.5%
Argmax W/D/L
ECE
Calibration error
Sharpness
0.491
Climate ≈ 0.667
RPS spread (fold std)
± 0.0158
Min 0.1454 · max 0.1902
Accuracy spread (fold std)
± 4.6pp
Across 8 folds
Aggregation
size-weighted
Larger folds count more
Where do those numbers sit?
Accuracy 61.5%
Random guessing gets 33%. Always-pick-home gets ~46% on international. A standard Elo rating gets ~55%. Sharp betting markets get ~62–65%.
Mean RPS 0.1629
Lower is better; 0 is perfect, 0.222 is random. Elo-only is ~0.19. Sharp markets are ~0.14–0.16 — one credible step ahead of us. Closing the gap needs roster info, not better statistics.
ECE —
"When we say 60% chance, it should happen 60% of the time." 0 is perfect; 0.05 is good; 0.10+ is poor. Lower means our probabilities are honest.

Accuracy over time

By rolling fold

A flat line is the dream — it means model quality is stable as new tournaments are added. A trend tells you something is shifting (better data, regime change, or, awkwardly, overfit).

RPS & sharpness per fold

Walk-forward

Both scores are proper — neither rewards over-confidence. RPS rewards correctly ordered probabilities (a home-win prediction near 0.6 helps even if the team loses). Sharpness penalises any miss equally.

Walk-forward folds

Train ≤ cutoff · score next 180 days
Cutoff Train n Test n RPS Sharpness Accuracy Fit s
2018-01-01 39815 448 0.1718 0.5123 62.1% 0.44
2019-01-01 40681 842 0.1547 0.459 64.7% 0.47
2020-01-01 41758 243 0.1902 0.5809 51.0% 0.45
2021-01-01 42104 900 0.1454 0.4543 65.0% 0.43
2022-01-01 43220 610 0.186 0.5447 57.7% 0.44
2023-01-01 44174 752 0.1679 0.4903 61.6% 0.46
2024-01-01 45177 984 0.1649 0.5099 58.9% 0.43
2025-01-01 46401 756 0.1528 0.4615 63.0% 0.49

Calibration · predicted vs observed

Pooled across folds

Dots near the diagonal mean "when we say 30 %, it happens 30 % of the time". Above the line = under-confident; below = over-confident. Symmetry across the three outcomes tells you the model isn't dragging probability between draws and decisive results.

What's tunable

Half-life

6 yr · how quickly old matches fade. Lower = more responsive, more variance. The 24-cell sweep showed NLL is flat across [4, 8] years.

L2 ridge

0.3 · pulls under-observed teams toward zero. Critical for tiny FAs like Bhutan.

Competition weights

Friendly 0.50 · WC main 1.60, matching Nate Silver's PELE midpoints.

Not modelled

Travel distance HFA · altitude · negative-binomial draw correlation.

In-tournament calibration

Live calibration starts 2026-06-11

Once matches start, every fixture's pre-match probability is locked in and scored against the actual W/D/L outcome. A running reliability plot (this tournament only) appears here, separately from the multi-decade backtest above, so a reader can see whether the model holds up on this draw.

The companion model-vs-market page scores the same fixtures against the closing market line, so both "are we calibrated?" and "do we beat the market?" are answered on the same fixture set. Post-tournament, this page becomes the survivor artifact for the lab.