I Pitted XGBoost Towards Logistic Regression on 358 Matches. The Boring Mannequin Gained.

We Constructed a Routing Layer to Reduce Our AI Prices. It Broke the Product.

What Works and What Does not

of us share on a brand new modelling drawback: attain for the mannequin that wins. As of late that’s gradient boosting, and the reflex is normally proper — XGBoost earns its status on a staggering vary of issues.

So when I lined up 5 classifiers on the identical job and the one-line linear mannequin beat the Kaggle champion, the consequence was the type that surprises precisely no person who has shipped fashions on actual information, and nearly everyone nonetheless studying.

5 classifiers, identical job, identical options: predict whether or not a world match ends in a house win, draw, or away win. The contenders ran from a humble logistic regression up by way of a random forest, KNN, a small neural community, and XGBoost.

The best one received. Extra fascinating than that it received is why — and the why is likely one of the most helpful concepts in utilized machine studying. Right here’s the experiment, the consequence, and the idea that cracks it open.

The setup

This got here out of constructing a collection of 11 World Cup fashions, the place I wanted a consequence classifier and needed to know which household to belief. Every mannequin noticed the identical three options for 358 historic internationals — the 2010–2022 World Cups plus the 2020 and 2024 Euros: the power hole between the groups, their mixed power, and a knockout flag. The goal is the three-way consequence.

I scored them with 5-fold cross-validation, and the first metric is log-loss, not accuracy. That alternative does loads of work on this article, so it’s value being specific about it up entrance. Accuracy solely asks whether or not the top-ranked class was appropriate. Log-loss grades the total likelihood vector and punishes assured errors exhausting:

from sklearn.model_selection import cross_val_predict
from sklearn.metrics import log_loss, accuracy_score

proba = cross_val_predict(mannequin, X, y, cv=5, methodology="predict_proba")
print(log_loss(y, proba), accuracy_score(y, proba.argmax(1)))

For a forecasting mannequin whose total job is to emit calibrated chances, log-loss is the trustworthy scorecard and accuracy is a sanity examine. The quantity to maintain in your pocket is ln(3) ≈ 1.099 — the log-loss you’d get by shrugging and predicting a uniform 1/3 throughout the three lessons. Beat 1.099 and your mannequin is aware of one thing. Rating above it and also you’d have been higher off guessing.

The consequence

There are two issues within the outcomes under that ought to trouble you.

The primary is the rostrum: a plain logistic regression posted one of the best log-loss, and XGBoost — the mannequin that wins Kaggle competitions — got here final. The second is stranger and straightforward to skim previous. XGBoost didn’t simply lose; it scored above 1.099, the uniform-guessing baseline. A mannequin with a respectable-looking 48% accuracy was, by the metric that really issues right here, worse than a coin with three sides.

Cross-validated log-loss by mannequin. Picture by writer

Mannequin	CV log-loss (decrease is best)	CV accuracy
Logistic regression	1.001	54%
Random Forest	1.011	56%
KNN	1.013	53%
Neural community	1.115	52%
XGBoost	1.169	48%

Each of those info have the identical root trigger, and it’s essentially the most helpful concept on this complete article.

Why the boring mannequin received: bias and variance

The clear manner to consider that is the bias–variance decomposition. A mannequin’s anticipated out-of-sample error splits into three components:

Error = Bias² + Variance + Irreducible noise

Bias is error from mistaken assumptions — too inflexible a mannequin misses actual construction within the information.
Variance is error from sensitivity to the actual coaching pattern — too versatile a mannequin matches noise that received’t recur subsequent time.
Irreducible noise is the real randomness of the factor you’re predicting. In soccer it’s monumental: a single deflected shot decides a knockout tie. No mannequin touches this time period, which is why even one of the best classifier right here sits close to 50% accuracy.

The entire recreation is the commerce between the primary two. Excessive-capacity fashions, akin to boosted timber or neural nets, purchase low bias by being versatile sufficient to bend to nearly any form within the information. The invoice for that flexibility is variance, and it solely comes due whenever you don’t have sufficient information to pin the mannequin down.

And that’s is strictly our state of affairs. With 358 examples cut up throughout a three-way goal, you will have roughly 120 matches per class. An XGBoost ensemble, in the meantime, has 1000’s of efficient parameters unfold throughout its timber. There merely isn’t sufficient sign to self-discipline all of them, so that they latch onto quirks that occur to seem in a single cross-validation fold and vanish within the subsequent. That’s textbook overfitting, and it explains the primary trouble: cross-validation is doing its job by catching the versatile fashions red-handed on information they haven’t seen.

So why did XGBoost fall under random relatively than simply touchdown mid-table? That is the place the selection of log-loss pays off. The penalty for a single instance is −ln(p_true_class), and it’s brutally convex.

Predict the eventual consequence at a hedged 0.5 and also you eat −ln(0.5) = 0.69. Predict it at a confident-but-wrong 0.1 and also you eat −ln(0.1) = 2.30 — greater than thrice the ache for being positive and mistaken. An over-flexible mannequin on small information doesn’t simply make errors; it makes them with conviction, issuing sharp 60–70% chances and getting sufficient of them mistaken that the convex penalty drags its common under the timid 1/3-1/3-1/3 baseline.

The right identify for this failure is assured miscalibration, and it’s the signature of an excessive amount of mannequin for too little information. XGBoost’s accuracy edge on the occasional daring name couldn’t pay again what its overconfidence price all over the place else.

Why logistic regression specifically

Realizing that the versatile fashions would battle is barely half the story. The linear mannequin didn’t simply keep away from the lure — it was, for this drawback, the appropriate instrument. Two structural info make that so:

The true relationship is near linear within the log-odds. Most of what predicts a result’s “how large is the power hole,” and the likelihood of profitable rises easily and monotonically with it — precisely the purposeful type logistic regression assumes. When a mannequin’s inductive bias matches the data-generating course of, you want far much less information to estimate it effectively. The timber, in contrast, need to uncover that easy curve out of piecewise-constant splits, spending treasured information to approximate one thing logistic regression will get at no cost.
Three options, weak interactions. Bushes and nets earn their hold by searching down interactions amongst many options. With solely three options and little interplay between them, there’s nothing for that equipment to search out — so it provides variance with out including any sign to indicate for it.

There’s a rule of thumb from classical statistics value carrying round: you need on the order of 10–20 observations per parameter for secure estimates.

Logistic regression estimates a handful of coefficients towards 358 matches — comfortably inside that price range. A boosted ensemble is orders of magnitude over it. The mismatch was baked in earlier than a single mannequin skilled.

The way to learn the scoreboard truthfully

Earlier than drawing conclusions from that desk, two cautions about studying it — as a result of the identical small dataset that sank XGBoost additionally makes the numbers noisier than they appear.

The primary is the metric’s personal variance. With 358 matches, every of the 5 folds holds out solely ~72 video games, so the CV rating itself wobbles. The gaps amongst logistic regression, random forest, and KNN — 1.001 vs. 1.011 vs. 1.013 — are effectively inside that wobble. They’re successfully tied.

What’s strong and repeatable is the 2 ends of the desk: the easy linear mannequin is reliably on the prime, and essentially the most versatile fashions reliably on the backside. Learn the rostrum, not the photograph end.

The second is the accuracy column, which you need to resist over-reading totally. Three-way soccer outcomes are intrinsically exhausting as a result of the draw is an actual third consequence with no sturdy predictor — traditionally about 27% of those matches drew, and attracts are practically inconceivable to name upfront from staff power alone.

A mannequin that knew every staff’s true win likelihood nonetheless couldn’t push accuracy a lot previous the excessive 50s, as a result of the irreducible-noise time period is so massive. Seen that manner, logistic regression’s 54% isn’t mediocre — it’s close to the sensible ceiling for this function set. The true differentiator between fashions was by no means how typically they top-picked the winner; it was calibration, which is exactly what log-loss measures and accuracy hides. So: Lead with the right scoring rule; hold accuracy as a intestine examine.

Might the timber be rescued? With self-discipline, sure.

None of that is an indictment of XGBoost. It’s a press release about configuration relative to information measurement — and the identical algorithm, dealt with otherwise, may shut many of the hole. The lever is regularization: Buying and selling a bit of variance again for a bit of bias.

For XGBoost: shallower timber (max_depth=2–3), a stronger min_child_weight, subsample and colsample_bytree under 1, an L2 penalty (lambda), a low studying price with early stopping on a validation fold, and fewer rounds.
For logistic regression: the L2 penalty (C) is already doing quiet regularization within the background — a part of why it’s so secure straight out of the field.

Tuned exhausting sufficient, a regularized gradient-boosting mannequin would probably match logistic regression right here. However discover that “match the one-liner after cautious tuning” is itself the lesson, not a counterexample to it.

(The caveat within the different course: very massive, over-parameterized fashions can re-enter a “double descent” regime the place error falls once more previous the interpolation threshold — however that lives at information and parameter scales far past 358 matches.)

So how would , empirically, when the timber are lastly value it? Plot a studying curve: held-out log-loss towards training-set measurement, for every mannequin.

Two patterns are diagnostic. A high-bias mannequin like logistic regression plateaus early — extra information barely helps, as a result of the bias flooring dominates. A high-variance mannequin like XGBoost begins worse however retains bettering as information grows, as a result of additional examples are precisely what tame its variance. The purpose the place the 2 curves cross is the info price range at which the versatile mannequin begins to win.

On 358 worldwide matches we’re sitting clearly to the left of that crossover. Feed the identical XGBoost tens of 1000’s of membership matches with richer options — xG, relaxation days, lineups — and it might very probably overtake. Similar algorithm, completely different information regime, reverse conclusion. That contingency is the purpose.

The underside line: Select the mannequin along with your information

Mannequin complexity ought to match the info, not the hype. On large, messy, feature-rich issues, gradient boosting and deep nets routinely dominate — that’s why they’re well-known, and why the reflex to succeed in for them is normally a very good one.

However on a small, clear, low-dimensional drawback like this, the reflex is mistaken, and the self-discipline is to begin easy, set up a powerful baseline, measure with a correct scoring rule, and add complexity solely when held-out information says it earned its place. Logistic regression isn’t the comfort prize right here. Given the info, it’s the precise reply.

This self-discipline — begin easy, validate truthfully with log-loss and calibration, scale complexity intentionally — runs by way of the modeling chapters of Soccer Analytics with Machine Studying (O’Reilly, 2026 – recent from the press!): logistic regression and classification in Chapter 5, the tree-based strategies (XGBoost included) and precisely when their additional firepower pays off in Chapter 6.

So earlier than you attain for the most important mannequin in your subsequent mission, ask two questions: how a lot information do you even have, and the way will if the complexity helped? Typically the road of greatest match can also be the end line.