Predicting the 2022 NCAA Basketball Tournament Using Data Science

Hunter Kempf
9 min readMar 17, 2022

--

Photo by Dan Carlson on StockSnap

Every year many Americans fill out a Tournament Bracket and compete with their friends and families to try to guess the teams that will win each game. People come up with all sorts of strategies for picking their brackets from following picks from an expert or group of experts to watching many hours of basketball and picking based on the strengths they perceive each team to have all the way to the wacky strategies like full random selections or using their pets to pick for them. This year I was inspired to try to use data about regular season games and Data Science techniques to predict the tournament game outcomes.

Available Data Sources

While Data is available for NCAA games many places around the internet, I wanted to start with a basic dataset that would be relatively clean and easy to use so I decided to use the Kaggle Dataset released for this year’s tournament

March Machine Learning Mania 2022 — Men’s | Kaggle

This choice has a few nice features such as locking in my model before the tournament, automatically providing a scoring metric for my model when the games are played and having a variety of interesting datasets to create features from.

One downside of choosing Kaggle is that the results it has you submit are not in a bracket format and are a matrix of all possible combinations of teams then checks the scores against the outcomes of the games that actually were played so to create a bracket from my model I will have to do a bit of post-processing.

General Approach

My general approach for this challenge was to create feature data for each team from the results of their regular season (games played that season up to the tournament) to predict their probability of winning their tournament matchups that post season. One potential downside of this approach is that the model will not consider the prior results of a team in the tournament for the later rounds but will hopefully help me avoid issues with overfitting since the results for the whole tournament have to be predicted before any of the tournament games have been played.

For this model I decided to have a consistent approach of creating feature data frames keyed by Team IDs and Season for every set of features that I wanted to create out of the source tables. Doing this allowed me to create reusable methods to join all of these datasets to the training data without having to create complex join logic.

Finally once the feature DataFrame is complete I trained a model using all tournament matchups from 2003 to 2021. To help avoid overfitting I used a cross validation split of 10 and stratified on the column we are predicting (Win/Loss). Once that model was trained I then used the 2022 training data to predict the probabilities of the outcomes of all potential combinations of teams.

Feature Generation

I got inspiration from published models submitted to previous competitions and in cases where I used features described in those notebooks I will link to the original author’s work.

Team Power Rankings/Quality

This was inspired by a notebook on Kaggle by Raddar. The general idea is that it creates a simple GLM binomial model using only the two Team IDs playing in the game as features (treating them as Factors). This approach will give a relative team strength knowing nothing else about the team and is a nice baseline result to include as a feature for the more complicated model.

ELO

This was inspired by how chess rankings are created and gives less points to beating teams you are expected by the model to beat and more points to beating teams it doesn’t expect you to beat. Ultimately this ranking should give relative strength scores for each team but given the relatively short season can have some issues since at the beginning of any season all teams start at the same ELO.

Score Stats

I came up with these stats which basically aim to compare how much a team scores in games compared to their opponents scores against and vice versa how much the opponent scores compared to how much the team normally allows. Here I create percentages for each game based on the min, mean and max for those scores then take the min, mean and max over the season for the percentages. Since I came up with this feature on my own I cant link to any better explanation but hopefully it makes sense.

Advanced Stats

These features were inspired by a notebook Laksan Nathan shared on Kaggle. They cover advanced statistics such as offensive efficiency, defensive efficiency, assist ratio, turnover ratio, etc and hopefully will help to improve the comparison of the strengths and weaknesses of any team.

RPI (Team and Conference)

In previous years teams used to be seeded using RPI which tries to balance winning percentage with the quality of the opponents the team faced. I worked to create this metric for all teams and Ken Jee helped me to extend this to compare conference strengths.

Quadrant Wins/Quadrant Score

I was interested in understanding how the new NET Rank determines quadrant wins and found a great chart on BracketResearch.com which I created a similar count of Quadrant wins and Quadrant scores using RPI Rank instead of NET Rank.

Team Quadrant Wins and Losses Tracker | BracketResearch.com

Massey Ordinals

These features are pulled directly from a dataset curated by Kenneth Massey that has a detailed history of Public Rankings of teams going back to the 2002–2003 Season.

Tournament Seeds

These are the most straightforward features I used and are the numeric seeds of the teams as the selection committee ranked the teams in each of their regions (1–16).

Difference Features

Finally I took the difference between the matching features of the two teams and created an additional feature for each pair. The goal here was to do some feature reduction and focus the model to the comparison of the two teams as opposed to the strength or weakness of a particular team.

Future Feature/Improvement Ideas

While I spent a good amount of time working on this I didn’t get to creating all the features I wanted to and have some ideas for future feature creation I would like to do for a model next year!

Additional Models/Ensembles

Due to time constraints this year I only created one type of model (Light GBM) which did fairly well but I would love to explore other types of models and compare their performance and ultimately ensemble the results of a few of the best methods.

Player Based Features

All the features I created are based on the teams or conferences and I think a great way to improve my model would be to add Player Features such as recruit rankings of players, the experience (Frosh, Soph, JR or SR) make up of a team, individual statistics and how many players on average play meaningful minutes for a team. Finally Measurable stats like Height, Weight, Wingspan and other physical attributes would be interesting to add.

Game Control Features

I envision these features dealing with the time/points teams lead games by and would require detailed game data to create. It also would be interesting to see things like scoring consistency or conversely scoring droughts teams go through during the regular season and use them to help predict games.

Final Model Metrics

I trained 2 models, the first with all features and the second only with the difference features.

The model with all features had an average cross validation log loss score of 0.5545. The top 3 features were all difference features and many of the top features were Advanced Stats although the RPI, RPI rank and RPI Quad Score were all in the top 30 most important features.

Feature Importance’s for model with all features

The model with only the Difference features performed slightly better in cross validation with an average log loss of 0.54768. Many of the features that were important in the all features model were important for this model too. By just selecting the difference features the model now only gets 1 feature for each instead of 3 (higher, lower, and difference) and this dimensionality reduction seems to have helped.

Feature Importance’s for Diff only Model

Comparison to a model run on the Women’s NCAA Tournament

A casual fan might assume that the Men’s and Women’s tournaments might be able to use the same model to predict tournament performance. I did train a model using similar feature generation code but the women’s tournament model had a glaring difference.

Top Feature Importances for Women’s Tournament Model

Not sure if you caught that but Tournament Seed is the biggest predictor by a wide margin for the Women’s tournament but doesn’t make the top features for the Men’s tournament.

Top Feature Importances for Men’s Tournament Model

Compare that to the Men’s tournament model and you can see that even though the top 4 features have a wider margin of importance there isnt a single feature that dominates the model.

My 2022 Bracket

Using my model to predict each of the tournament games it didn’t predict many upsets (obviously the seeding is correlated with perceived team strength) but some interesting upsets are Loyola Chicago and Virginia Tech winning in the first round and LSU beating Wisconsin in the Second round. In many brackets people try to pick 5 seeds that are upset by 12 seeds and 6 seeds that are upset by 11 seeds which this model found one of. It would be interesting to try to add in a model that just predicts specific early round games which may be more likely to be upsets next year.

2022 Tournament Results

Visualization of my prediction performance — Kaggle Brackets (wncviz.com)

I placed in the top 16% of all the 2022 NCAA Mens Tournament Predictions submitted to Kaggle and while I predicted quite a few of the games right I also had some pretty big upsets that I wasn’t able to predict a high likelihood for (like St. Peters winning 3 games or Iowa losing to Richmond in the first round).

Wrap Up

I really enjoyed working on this Data Science Project and hope you found it interesting and informative. I would like to thank Ken Jee who worked on creating some features for this model with me.

My Data Science Setup

I am a Z by HP Global Data Science Ambassador which means Z by HP sponsors my content and has provided me with the following hardware that I used to create this Data Science Project and the others I work on!

Desktop: HP Z8

https://www.hp.com/us-en/workstations/z8.html

Having a powerful desktop is very important for iterating quickly while working on Data Science projects. Not having to wait as long between updates to my code and outputs lets me get more done. The powerful GPUs in my Z8 are very helpful when training large deep learning models.

Laptop: ZBook Studio 15

https://www.hp.com/us-en/workstations/zbook-studio.html

Being able to work on Data Science projects when I am away from my home office helps me to maximize my productivity. The ZBook Studio I have has a powerful enough CPU and GPU to handle a lot of the iterations and initial models I run.

Monitor: HP Z38c

Having a big high resolution display like the Z38c helps me to improve my productivity by running multiple windows at the same time. I usually have it open with a python notebook and internet browser, chat app or video call open on the other side of the screen.

Preloaded Software Stack

Being able to have the software I use for Data Science come preinstalled and configured to manage package updates helps me get started on projects and worry less about package compatibility issues that arise from updating packages one by one.

--

--

Hunter Kempf

Z by HP Global Data Science Ambassador | Data Scientist | Interested in Visualizations, Streaming Service and Video Game Data Analysis