Agent Lab — a live forecast exhibit · Applied AI
Method · Working note

How this forecast is built

A working note on the statistical model, the comparisons we publish, the context we monitor, and the boundary between them.

The model

Statistical baseline

The forecast starts with a Poisson regression on every international match played since the early 1990s — 47,000+ matches as of 2026-06-13, refit as new results come in. Each team has two latent skills: an attack rating (its tendency to score) and a defense rating (its tendency to concede). Matches are weighted by competition importance — World Cup finals carry far more signal than friendlies, on a multiplier scale adopted from Nate Silver's PELE methodology — and by recency, with a half-life of six years.

For each future fixture the model produces a pair of expected goal rates (λ_home, λ_away). These rates fully determine the win / draw / loss probabilities you see throughout the site.

The model structure itself (independent Poissons with team-level attack/defense parameters, fit by penalised maximum likelihood) is a standard form going back to Maher (1982) and Dixon-Coles (1997).

Simulation

Simulation and uncertainty

A single match has a probability distribution; a 64-match tournament has a much wider one. To estimate champion odds we run 20,000 Monte Carlo tournaments, drawing each match's goals from its Poisson distribution and advancing teams through the group stage (with FIFA Annex C tiebreakers), the round of 32, and the knockout bracket.

Team strengths are themselves uncertain — fit on a finite sample of matches, with skill that drifts over time. A match-resampling bootstrap (≈50 replicates) gives a probability band on each team's champion odds, not just a point estimate. The band is why each champion number on the site comes with a range around it, not just a point.

Comparison

Model vs market

The /vs-market page compares the model's probabilities head-to-head with prices implied by public markets. Markets aggregate information the model cannot see; the model is calibrated to historical results in ways markets are not. Neither is treated as ground truth.

Where they agree, a forecast has two independent sources behind it. Where they diverge, the gap marks a match worth a closer look — which is where the agent comes in.

Calibration

Public calibration

A forecast that says "70%" should be right about 70% of the time. We test this with a walk-forward backtest over 2,117 competitive matches across eight folds, and publish the reliability diagram and scoring rules — ranked probability score (RPS) and log loss — on the /calibration page, with the underlying numbers, not only the chart.

Why it's hard

What good accuracy looks like

Single-match soccer is genuinely hard to forecast, and it is worth being precise about why. It is a low-event game — two or three goals a match — so one deflection, red card, penalty call, or finishing fluke can swing the result; the law of large numbers never gets the innings or possessions it gets in baseball or basketball. Goals are well modelled as near-independent Poisson processes with team-specific rates (Heuer, Müller & Rubner), which is exactly why draws are common and why the three-way win / draw / loss problem is harder than a binary win-or-lose framing makes it look. And the outcome noise is large enough that elaborate, feature-heavy models routinely fail to beat simple team-strength baselines ("Luck is Hard to Beat", Aoki et al.) — which is the empirical case for the disciplined, penalised baseline used here rather than an over-fit kitchen sink.

Hard is not the same as random. Soccer is not the least predictable sport: in a cross-discipline study of more than 300,000 matches across nine sports (Coscia, 2024) it sits mid-pack — an AUC ≈ 0.71, about level with basketball, above baseball, hockey and American football, below volleyball and handball. But that study discards draws and scores only home-win versus lose, which understates soccer's real difficulty: draws are a first-class outcome here, and soccer produces the most of them of any sport in the comparison. The honest benchmark is the sharp betting market — even sharp prices leave plenty of match-level uncertainty and are hard to beat once converted to fair probabilities, and they carry known biases (favourite-longshot bias in 1X2 markets; Asian-handicap markets often sharper — Hegarty & Whelan; Angelini & De Angelis). So we anchor expectations rather than hide them: the sharp market is the practical benchmark, a naive prior is the floor, the model is judged by where it lands between them, and the agent is judged only by whether its adjustments improve the model's score. A World Cup is the hard end of even this — a short tournament, national sides with few minutes together, knockouts, extra time, penalties — which is the honest reason not to promise high accuracy.

Sources & transparency

Where the inputs come from

Every external input — the historical results dataset, FIFA ranking snapshots, club-strength inputs, market-price feeds — is recorded with a fetch date and an integrity hash. No hidden inputs. If a number on the site changes, the registry shows what changed and when.

The agent's context (section above) is held to the same standard. For each national team we keep a curated source registry: federation channels, native-language outlets, and English fallbacks, each URL verified live. Only the outlets listed there can be cited — nothing is scraped from anywhere else.

Browse the full per-team registry at the source list, or jump straight to a team's outlets from its Sources link on the team page.

The agent

The agent

The statistical model cannot see things that decide matches: injuries, lineup choices, suspensions, recent club-level form, manager changes, tactical shifts. Markets usually can. The largest model-vs-market gaps are where that difference shows up.

The agent is an LLM-based pipeline that reads the live context the model can't see, then publishes its own adjusted win / draw / loss forecast beside the model — together with a sourced, structured note explaining the gap, in three parts:

  • What the model weights heavily — drawn from the model's actual inputs.
  • What the model cannot see — structured signals (injuries, lineup news, club form) collected from defined sources, with citations. For each fixture the agent decides what to look up — which late-fitness, lineup, or venue questions are worth a web search — and the call shows those self-chosen queries and why it ran them.
  • What the market may be weighting differently — Polymarket price context and a sourced reading of the move.

No silent model changes. The served model numbers come only from the Poisson fit and simulation; the agent never edits them. The agent publishes its own forecast in clearly-labelled, separate surfaces — its per-match call and its own champion-odds column — shown beside the model, never substituted for it.

After the final

Post-tournament scoring

After the final, we publish per-match and tournament-level scoring of every probability we issued: RPS, log loss, and calibration against the realised outcomes.

The agent's notes get a separate, qualitative review — whether each note called out the factors that actually decided the match. This is not built yet; it ships after the tournament, once there are resolved matches to review against.

About this work

This site demonstrates one capability: structured, sourced explanation of disagreement between a quantitative baseline and an external signal.

Many forecasting settings have both — a statistical model (actuarial, demand, credit, underwriting) and a parallel external signal (peer benchmark, competitor pricing, consensus estimate, market price). When they diverge, the explanation is usually reconstructed by hand, case by case, with no durable record of what drove it. The pattern shown here — baseline plus a sourced note on the gap — is the same one that applies in those settings. The World Cup makes it concrete.