So You’re Telling Me There’s a Chance?

Allen Jarvis
Analyzing NCAA Basketball with GCP
5 min readMar 19, 2018

Your men’s college basketball team is up 5 with 2:30 left in the second half. How nervous should you be?

As humans we tend to look at signals to “objectively” modulate our feelings about an outcome. In this case, you might be feeling nervous about:

  • The frequency with which your team loses close games
  • Whether your star defender will foul out
  • That your top free throw shooter has gone 1–9 from the line
  • That you have to leave to pick your kids up from school and traffic is already bad

Any of these signals might help you make a decision. But signals like these can still create bias because they may not be relevant to the actual probability of the outcome. Moreover, these biases, especially when they masquerade as pertinent data, can produce poor results when estimating probability. (Check out this excellent post on cognitive bias busting if you’re curious.)

You might think that scouring the internet for box scores or shooting percentages will give you insight beyond a gut feeling into your team’s current situation. But those pieces of data — nice though they may be — aren’t the ones you need. What you need is scoring information relative to time and context to the game, which you’d be hard-pressed to find in a quick search. You need play-by-play data.

Fortunately, Google Cloud, the NCAA, and SportRadar have put together a BigQuery public dataset comprised of more than 42,000,000 plays from every men’s and women’s college basketball game since 2009. With this data warehouse, you can answer lots of interesting questions. For example, in just a few days, we (three Google nerds —@easchmidt, @elissa.lerner, and I) have found each team’s record in close and late games, built a machine learning model that calculates the likelihood of a game being close and late, and found every buzzer-beater in NCAA basketball this year. We did this all with BigQuery. No virtual machines, no special indexing, no sharding. Just “run query.”

Calculating Win Percentage Based on Time and Score

So, back to the game. Given all this data and BigQuery horsepower, how might you keep anxiety at bay by objectively calculating win percentage based solely on time and score?

First, we calculated the score margin at every second in the last five minutes of each of the approximately 48,000 D1 vs. D1 games available. This wasn’t pretty — it required 301 SQL queries to capture the score margin in every game at each of the 301 seconds (5 minutes * 60 seconds + 0:00) in the final five minutes of each game. Then we unioned those queries to get a complete picture of the final 301 seconds of all games in our investigation. In case you’re wondering what anxiety analysis looks like, for our purposes, it looks like more than 14 million rows of data.

Next, we had to look at each of those 301 entries for each game and calculate if the team that was leading was still leading when the clock hit “00:00” (also known as “winning”).

Finally, we grouped all of these results by the score margin and the game clock, yielding a clear win probability per second.

It turns out that on average, teams that are up by 5 with 2:30 left win in regulation 86.77% of the time.

If you’re more data scientist than basketball fan, you could use this to optimize when to leave the game and head to the parking lot (plus or minus some additional datasets on traffic patterns, naturally). But seeing as it’s hardly a done deal, we’d recommend staying put. After all, what if the game goes to overtime?

With play-by-play data, we can look at that, too. To get a sense of how leads change in the final seconds of regulation, let’s narrow our scope to the last minute and the point margins with the greatest probability of yielding overtime (5 and under).

We can immediately see that a tied game (margin = 0) has the greatest likelihood of going to overtime. (Duh.) But another picture emerges, which is that a three-point deficit has a better chance of forcing overtime than a one-point deficit. This may be fairly intuitive (it’s rare to see a team leading by one point foul the other team and give them a chance to tie or take the lead), but to see it illustrated in the data hammers it home.

In any case, if you’re still worried about the game going to overtime, we can run that query too. We see that when there is a lead of 5 with 2:30 remaining go to, the game goes to overtime 8.20% of the time. It also means the leading team blows the game entirely 4.63% of the time.

Granted, this doesn’t tell us anything about what those overtime games look like. For the truly anxious fan worried about that 8.20% of overtime situations, we’ll dig into momentum in another post.

Sidenote: Lest you think this information is only useful for the anxiety-prone or traffic-averse fan, there’s plenty of untapped potential here for coaches, for starters. Sports fans endlessly debate the decisions of when coaches call for strategic fouls late in the game. Part of this decision is about allocating possessions, which requires separate analysis. But another part is about how the potential points will affect the overall likelihood of winning. For example, if you foul a 50% free throw shooter when the other team is in the double bonus, you can expect the margin to change by 1. (We know that 50% is a terrible free throw percentage, but the expected gain for a 75% shooter is 1.5 and score margins are whole numbers.) So now that you know how much the margin change from fouling decreases your likelihood of winning, you only have to debate whether potentially gaining possession is worth it.

In Summary

Equipped with high-fidelity outcome data, you can start layering in more skilled signals to implement a probability estimator that is suited towards your objectives. If you built a model that was team, player, and location specific, you’d have an even better sense of whether to sit or run to the car! Less emotional bias + better features == winning. Have fun writing those queries.

Note: This post was inspired by one of the great data scientists of all time. Thank you Lloyd Christmas.

--

--