Section 06 · Calibration scorecard

When we say 60%,
does it happen 60% of the time?

This is the dated bet: a forecast is only honest if it's scored, and scored under a proper rule. The walk-forward backtest below trains the model on history up to a cutoff and grades it on the next six months — twelve times, sliding forward — so the headline metrics are out-of-sample rather than self-reported. Live in-tournament scoring is wired up below the backtest and switches on once Group A kicks off.

Folds

5535 test matches

Mean RPS

0.1629

Lower = sharper

Accuracy

61.5%

Argmax W/D/L

ECE

—

Calibration error

Sharpness

0.491

Climate ≈ 0.667

RPS spread (fold std)

± 0.0158

Min 0.1454 · max 0.1902

Accuracy spread (fold std)

± 4.6pp

Across 8 folds

Aggregation

size-weighted

Larger folds count more

Where do those numbers sit?

Accuracy 61.5%

Random guessing gets 33%. Always-pick-home gets ~46% on international. A standard Elo rating gets ~55%. Sharp betting markets get ~62–65%.

Mean RPS 0.1629

Lower is better; 0 is perfect, 0.222 is random. Elo-only is ~0.19. Sharp markets are ~0.14–0.16 — one credible step ahead of us. Closing the gap needs roster info, not better statistics.

ECE —

"When we say 60% chance, it should happen 60% of the time." 0 is perfect; 0.05 is good; 0.10+ is poor. Lower means our probabilities are honest.

Accuracy over time

By rolling fold

A flat line is the dream — it means model quality is stable as new tournaments are added. A trend tells you something is shifting (better data, regime change, or, awkwardly, overfit).

RPS & sharpness per fold

Walk-forward

Both scores are proper — neither rewards over-confidence. RPS rewards correctly ordered probabilities (a home-win prediction near 0.6 helps even if the team loses). Sharpness penalises any miss equally.

Walk-forward folds

Train ≤ cutoff · score next 180 days

Cutoff	Train n	Test n	RPS	Sharpness	Accuracy	Fit s
2018-01-01	39815	448	0.1718	0.5123	62.1%	0.44
2019-01-01	40681	842	0.1547	0.459	64.7%	0.47
2020-01-01	41758	243	0.1902	0.5809	51.0%	0.45
2021-01-01	42104	900	0.1454	0.4543	65.0%	0.43
2022-01-01	43220	610	0.186	0.5447	57.7%	0.44
2023-01-01	44174	752	0.1679	0.4903	61.6%	0.46
2024-01-01	45177	984	0.1649	0.5099	58.9%	0.43
2025-01-01	46401	756	0.1528	0.4615	63.0%	0.49

Calibration · predicted vs observed

Pooled across folds

Dots near the diagonal mean "when we say 30 %, it happens 30 % of the time". Above the line = under-confident; below = over-confident. Symmetry across the three outcomes tells you the model isn't dragging probability between draws and decisive results.

What's tunable

Half-life

6 yr · how quickly old matches fade. Lower = more responsive, more variance. The 24-cell sweep showed NLL is flat across [4, 8] years.

L2 ridge

0.3 · pulls under-observed teams toward zero. Critical for tiny FAs like Bhutan.

Competition weights

Friendly 0.50 · WC main 1.60, matching Nate Silver's PELE midpoints.

Not modelled

Travel distance HFA · altitude · negative-binomial draw correlation.

In-tournament calibration

Live calibration starts 2026-06-11

Once matches start, every fixture's pre-match probability is locked in and scored against the actual W/D/L outcome. A running reliability plot (this tournament only) appears here, separately from the multi-decade backtest above, so a reader can see whether the model holds up on this draw.

The companion model-vs-market page scores the same fixtures against the closing market line, so both "are we calibrated?" and "do we beat the market?" are answered on the same fixture set. Post-tournament, this page becomes the survivor artifact for the lab.