Predicting March Madness with Logistic Regression

Fernando Murias
Mar 20, 2019 · 5 min read

Every year after the tournament teams are announced on Selection Sunday, millions of people fill out brackets and try to predict the next Cinderella story. Everyone has their own strategy; whether it involves picking your favorite player or the best mascot, but below we’ll review how to use data to find value across the sports betting market. No one can predict what happens in March, but analyzing the data can help us identify some teams who aren’t getting enough love. In this case, we’re going to use a logistic regression model built off a few key metrics to predict the victor in a hypothetical match-up between two teams.

The Model

A logistic regression model allows us to use input variables to make a prediction on whether or not a binary (yes / no) event will occur. The model uses only the specified input variables to make a prediction without taking anything else into consideration. This approach removes all bias from the equation, so you don’t have to think about things like who has the higher seed. The model analyzes every game throughout the season and learns how important each variable is in predicting the winner of a game. Below are the variables used by the model when making a prediction:

Four Factors

  • Offensive & Defensive Effective FG %
  • Offensive & Defensive Turnover %
  • Offensive & Defensive Rebounding %
  • Free Throw %

Strength of Schedule

Margin of Victory

The first four variables listed are known as the Four Factors and have been used for years to predict NBA games. There is an increased level of parity when comparing the NBA to college, so we also need to account for a team’s strength of schedule. Adding the KenPom rank as an input variable means that each game throughout the season is weighted based off the level of competition. It is more impressive for a team like Duke, who plays in the ACC, to be statistically dominant than it is for someone like SF Austin. The final input variable is margin of victory, so a team is upgraded more for a 30-point win than a 3-point win.

How to Use it

If you used the model to predict the outcome of every game this year, you would have correctly predicted the winner in 76% of all games. The model also has a log loss under 10%, meaning that it performs better in predictions with a higher degree of confidence.

In order to find undervalued teams and good betting opportunities, we can compare the model’s predicted winning probabilities to the implied winning probabilities from actual Vegas lines. The bigger the difference between the model and Vegas, the bigger the betting opportunity. By calculating the difference between the expected win probability and the Vegas posted odds we can determine how big of an edge we have over the house. This edge will determine how much to bet in each game.

Before diving into the plays for Round 1 we’ll analyze the model’s performance from the 2018 tournament. All plays are in terms of units, which is the amount you typically bet on a game.

All of the model plays are on a Moneyline basis, meaning we are picking the winner of the game (not the team that covers the spread). Casual sports bettors have a tendency to bet on favorites, or the team with a higher probability to win the game. Vegas sportsbook will adjust lines accordingly to capitalize on the public’s tendency to bet on favorites and this dynamic often creates significant value on betting underdogs. When reviewing the plays below it’s important to look at the odds, which dictates how much a bettor is compensated for a $100 bet. The model had a record of 9–9 last year, but it was still incredibly profitable since most of the plays were on large underdogs. Over the entire tournament there were 18 official plays with a total of 25 units invested. The average odds on these 18 plays was +256 so even hitting 50% still generated a profit of 16 units, which is an ROI of 64% off 25 units invested.

2018 Model Official Plays — Full Tournament

The plays for Round 1 of the 2019 tournament can be found below.

2019 Model Official Plays — Round 1

The most underrated team in the tournament according to my model is the Houston Cougars. If you are looking for a future bet with positive expected value, considering taking Houston to win the Midwest at +500. This is a 0.5 unit play to win 2.5 units.

Power Rankings

We can also use the model for some more detailed analysis beyond head to head match-ups. The model makes a prediction for every possible match-up regardless of where teams are located on the bracket. Since there are 68 teams in the tournament, the model will generate a predicted probability of winning against all 67 competitors. For each team you can sum the predicted win probability in all 67 simulations and then divide by the total number of games (67), this gives you the average win %. The average win % tells you which teams the model predicts will have the most success throughout the tournament.

The Intelligent Sports Wagerer.

Identifying value by comparing historical cover rates and…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store