Villanova will win the 2016 March Madness tournament, according to my machine learning model
Kaggle is holding a great contest that asks people to submit predictions for the 2016 NCAA March Madness tournament. Best part: They provide a slew of game data dating back to 1985 (thanks, Ken Massey!) that you can use to build your model.
My method for predicting the winner is simple:
Looping through every regular season game, starting in 1985, we…
- Add the below feature row to our X array and the win or loss to our y array
- Update each team’s Elo score after the win or loss (we start it at 1600)
- Average 13 stat categories (like field goals attempted, offensive rebounds, etc.) from the previous five games, which we use on the next iteration of the loop where this team plays as our feature row, X, as mentioned in #1
So a feature row looks like:
team 1 elo, stat 1, stat 2…stat 13, team 2 elo, stat 1, stat 2…stat3
And the label is 1 for “team 1” win or 0 for “team 2” win.
Once we’ve looped through all of the games, we use our X and y arrays to fit a Logistic Regression model. (I’m using Sklearn in Python.) All told, the model is trained on 68,306 regular season games and nets a cross-validated accuracy score of 0.726.
Once learned, I use the model to compute the probability of every team winning against every other team who made the tournament. Then it’s just a matter of using the probabilities to fill out a bracket.
After doing so, I concluded that Villanova is going to beat Michigan State to win it all!


I’m not sure how accurate this is, really, so if you plan to use this data for your own pool, please keep in mind that I’ve watched exactly one basketball game in my entire life, so I’m as close to being an expert as your local tadpole. But I’m posting my results here in an effort to keep myself honest. If it turns out to be a winner, I’ll do a full writeup and release the code.
My calculated probability of Nova winning each matchup is:
UNCA — 98%
Iowa — 88%
Arizona — 64%
Kansas — 59%
Oregon — 58%
Michigan State — 52%
I’m not mathematician, but I think that means I’m giving Nova around a 10% chance of winning the tournament.
Update, April 5: I’ve done a follow up post, with more information and links to the code.
Note: If you’re curious, the screenshot is from USA TODAY Sport’s NCAA Bracket game. Disclosure: I work at USA TODAY Sports and was heavily involved in the creation of this game. Double disclosure: I do not have access to information about the tournament that is not public. All data used for my model was provided in the Kaggle competition dataset.