Predicting Sports Outcomes with a Simple Probability Model

Published in

Data Science Rabbit Hole

5 min readJun 9, 2024

Why You Need to Know the Odds, Not the Probability

Once upon a time, I was both a sports fan and a data scientist — a combination that naturally led to a lot of analysis (and possibly a little gambling). But as life’s priorities shifted, I drifted away from sports. That is, until my teenage son developed an obsession with the NBA, and I found myself scrambling to catch up.

Suddenly, I was confronted with a barrage of the unfamiliar. The Clippers were good now? Oklahoma City had a team? And what exactly was a “zone defense” doing in the NBA? I felt like Rip Van Winkle waking up in a world where basketball had become a foreign language. But one thing hasn’t changed: We all want to answer “Who’s going to win?”

As a data scientist, I knew that the answer to predicting game outcomes lay in the numbers.

The solution is here somewhere… I know it!

My first instinct was to use each team’s winning percentage. If the Celtics had won 80% of their games and the Heat had won 40%, then surely the Celtics’ chance of winning when the two teams faced off could be calculated by dividing the Celtics’ “strength” by the combined strength of both teams: 80 / (80 + 40) = 67%. It made sense, but something didn’t feel quite right.

If the Celtics had an 80% win percentage against teams that, on average, won 50% of their games (since for every win in the league there must be a loss), then shouldn’t their chances against a team with a 40% record be higher than 80%, not lower? This was as confusing as trying to understand why my daughter has so many TikTok followers.

I considered averaging one team’s winning record with the other team’s losing record. If the Celtics won 80% of their games and the Heat’s opponents won 60% of theirs, then the Celtics’ chance of winning would be the average of 80% and 60%, or 70%. Better, but still not quite there. Like my attempts to understand what an “influencer” does all day.

Now I was invested. My data scientist brain clicked fully into gear. What do I usually do when my probabilities aren’t working out? Maybe work in some kind of Bayesian estimate? Maybe use a Beta probability distribution? I’m trying to keep this simple, dammit! I have a teenager to impress, not a dissertation to defend.

I can impress a teenager by doing math, right?

Then it hit me: odds. In logistic regression, for example, odds are more practical stand-in for probability. (Log-odds would be too much to do in my head!) If I used odds instead of win percentages in my original formula, I might be onto something. For the Celtics, an 80% win rate equates to 4-to-1 odds in favor. The Heat’s 40% win rate is 3-to-2 odds against. Plugging these into the formula, I got (4 / (4 + 2/3)), or 86% (when rounded). Now we were cooking with gas — or at least with a high-efficiency, data-driven stove.

The formula, (Pc / (1 — Pc)) / (((Pc / (1 — Pc)) + (Ph / (1 — Ph))), was a bit unwieldy, but with some simplification, it became the much more elegant (Pc (1 — Ph)) / ((Pc (1 — Ph)) + (Ph * (1 — Pc))). Ah, simplicity! The holy grail of overthinking.

Oh that is nice aesthetically! It combines my two initial ideas. Look at the two key elements: (Pc*(1-Ph)) and (Ph*(1-Pc)). Each one is the product of the two ways of looking at a team’s winning chances: their win percent and their opponents’ losing percentage. And the overall formula is that same strength of one team divided by the sum of the strengths of the two teams.

Curious, I thought a bit more deeply. Why is this working? As I was taught in school, make a graph if you really want to understand. Here it is.

This chart is the key to understanding the whole thing!

The x-axis represents the Celtics’ chances of winning or losing, while the y-axis represents the Heat’s. Each team’s performance could be thought of as a roll of the dice, with the “Celtics win” area representing the region where the Celtics’ roll called for a win and the Heat’s roll called for a loss. The “Heat win” area represented the opposite scenario. It’s like playing basketball Dungeons & Dragons, but without the dragons and with slightly less chance of a magic critical hit.

The other two areas are where the dice rolls disagree with one another, and in my imaginary NBA game the rule then calls for re-rolls. So the chances of the Celtics winning can be found by calculating the percent of the two highlighted areas that belongs to the “Celtics win” zone.

Finally at this point I went to the trouble of looking online, having finally realized that I was probably not the first person to ever think about this. For the academically-minded among you, I’d refer you to the Bradley-Terry model. And of course baseball’s Bill James worked on this back a million years ago; using baseball statistics, I found an excellent article at https://sabr.org/journal/article/probabilities-of-victory-in-head-to-head-team-matchups/ that I recommend.

For a quick estimate of winning chances in any sport with a balanced schedule, this model works remarkably well. Be careful if not many games have been played, or if one team or the other has a record close to 100% or 0%, but for the most part we’ve got this down!

So the next time you find yourself debating who’s going to win the big game, the answer might be simpler than you think. With a little bit of data and a dash of calculation, you too can become the oracle of the sports world. As for me, I’m off to impress my son with my newfound NBA knowledge — and maybe place a bet or two along the way.

AI-generated mash-up of me and Doc Strange

Michael Bagalman is a professional data scientist who had a hard time understanding the idea of odds. Now that he has finally done it, he wants to use odds every chance he gets. He’ll be writing more about the NBA Finals in the coming days.

Predicting Sports Outcomes with a Simple Probability Model

Written by Michael Bagalman