New and Improved March Madness Neural Network for 2020

reHOOPerate
re-HOOP*PER-rate
Published in
8 min readJan 15, 2020

It’s been almost 2 years since I first trained a neural network to fill out my March Madness bracket. I used the same neural network implemented with Google’s TensorFlow to predict my March Madness bracket last year, but due to my busy schedule (I was actually getting married around March Madness time last year!), I didn’t have time to write a blog post about it. Long story short, the neural network gave strong signs that Oregon would upset Wisconsin in the first round, that Ohio State would upset Iowa State in the first round, and that UVA would win it all. I actually had a friend help me put down some bets on Oregon upsetting Wisconsin and Ohio State upsetting Iowa State, as well as on UVA winning it all. In the process I managed to turn $120 into over $500, not bad for my nifty little neural network.

To review, my approach to March Madness looks at predicting winners in the tournament as a value problem. Just like how Warren Buffett and Benjamin Graham look at investing as finding undervalued stocks and avoiding overvalued stocks, I treat the seed of each NCAA tournament team as a pre-set “value” for that stock. Then I use machine learning to find what combinations of advanced stats correspond to teams that “underperform” (i.e. are overvalued) or “overperform” (i.e. are undervalued) relative to their seed. When an overvalued team plays against an undervalued team, to quote the great Dick Vitale, “Upset city, baby!”. As I wrote in my original neural network for March Madness blog post:

“I decided to train my machine learning algorithms to use advanced statistics to classify which high seeded teams tend to be “upsettable”, and which low seeded teams have “upset potential”. To do this, I came up with a metric, “Wins Above Seeding”, that measures both how well a team resists being upset and how good a team is at upsetting higher seeds. In short, Wins Above Seeding (WAS) measures how many more games a team wins than it’s seeding would suggest. For example, a 7th seed would be expected to win a game against a 10th seed in the first round, but would be expected to lose a game against a 2nd seed in the second round. If that 7th seed makes it past the 2nd seed and advances to the Sweet Sixteen, it would have a WAS of +1. If it won another game and advanced to the Elite Eight, it would have a WAS of +2, and if it made it to the Final Four (as South Carolina did as a 7th seed last year), it would have a WAS of +3. Conversely, if a 1st seeded team loses in the second round, it would have a WAS of -3 (since a top seed implies the team is expected to make the Final Four, advancing 3 games past the second round). If I could get data on both how past NCAA teams tended to play, and on how many Wins Above Seeding they were able to obtain, a machine learning algorithm would be able to find patterns that correlate with March Madness success.”

In short, I’m taking advantage of the fact that the “seeds” in the NCAA tournament bracket are all set up by a bunch of old guys in Indianapolis. While overall they gets things right (as evidenced by higher seeded teams winning about three quarters of the time), I believe they have certain biases that cause them to overvalue certain teams (for example, teams that pass the ball a lot might naturally appeal to the old folks running the NCAA who are used to a certain style of play). In addition, I believe there are certain traits that make teams more likely to succeed in March Madness, when their opponents start really clamping down on defense and both teams start fighting for every loose ball. By using machine learning, I believe I can “correct” these biases, and I can also figure out what specific type of advanced stats profile corresponds to an edge (or a disadvantage) in college basketball’s postseason.

In the last few weeks, I decided to “clean up” and revamp my March Madness neural network. My original dataset took advanced stats data directly from The Sports Reference, but this data was all in different formats: strength of schedule is a number roughly between negative 12 and 12, Free Throw Rate and Three Point Attempt Rate are listed as decimals, while PACE and Offensive Rating are numbers greater than 50. I decided to clean up this dataset using feature normalization: for every season that I had available data (from 2010 to 2019), I would take the highest value for each category among all March Madness bound teams, subtract the lowest value among the March Madness teams. I would then take this difference and divide each team’s value by the difference. For example, if I was looking at Free Throw Rate in the 2018–2019 season, I would take the highest Free Throw Rate among March Madness teams, UCF at 0.453, and subtract the lowest Free Throw Rate among March Madness teams, Liberty at 0.265 (keep in mind that all tournament bound teams are marked with a “NCAA” tag next to them on Sports Reference). In this case, after feature normalization my dataset would end up with UCF having the value 1 for Free Throw Rate, Liberty having the value 0, and a team like Kentucky with a Free Throw Rate of 0.419 having the normalized value:

(KentuckyFTR-MinimumFTR)/(MaxFTR-MinimumFTR)=
(0.419–0.265)/(0.453–0.265) = 0.819

Normalized the Free Throw Rate relative to the best and worst free throw rate teams in the tourney

I performed this feature normalization for each of the 15 advanced stats that made up my dataset. With feature normalization, I’m hoping data fields that The Sports Reference happened to use a large number range for don’t exert disproportionate impact on the neural network’s training, resulting in more accurate results for the 2020 tournament.

To accomplish this in Python, I first read in my original dataset using Jupyter Notebook:

Original March Madness dataset for teams from 2010 to 2019

and then looped through the entire dataset, grouping the data for each season together, finding the maximum and minimum in different stat categories for each season, and then normalizing using the equation shown above:

Part of the loop in the Python script to normalize the existing dataset

The normalized data looked something like this:

With this new data, I once again tried some simpler machine learning techniques like Random Forests and Decision Trees. Just like before, these classifiers gave results that were below the 76% accuracy that simply picking higher seeds would have given (in fact, most of these classifiers were correct less than 50% of the time!). It seems like neural networks are definitely the way to go for this dataset!

Instead of TensorFlow, my revamped neural network uses Keras, another open source neural network library that works at a higher level than TensorFlow. Essentially, Keras wraps around TensorFlow, making it easier to put together neural networks. Like TensorFlow, Keras was also developed at Google, in this case by engineer François Chollet. One of the big benefits of Keras is that it’s very easy for me to add regularization to prevent overfitting, where my neural network corresponds too closely to the existing data and loses its ability to predict results more generally. In this case, I added both dropout regularization and L2 regularization. In dropout, during training some of the output features of a neural network layer are randomly, well, dropped out (i.e. set to zero) to fight overfitting. In L2 regularization, a cost is added proportional to the square of the weight coefficients in the neural network, in an effort to keep the values of these weights small.

It was easy to put together my neural network using Keras, as you can see in the code snippet below. After some trial and error, I found that I was seeing solid results for a neural network with 256 nodes in the input layer (chosen because there are 15 data fields in each training instance, and 15²=225=~256), with 128 nodes, 64 nodes and 32 nodes in each of the consecutive hidden layers. Note that we add a bit of L2 regularization to each layer, and add dropout set at a ratio of 0.1 after each layer. Since I approach machine learning for March Madness as fundamentally a categorization problem (the Wins Above Seeding value for a given team corresponds to its category), I’m using a Rectified Linear Unit (or ReLU) activation function. Finally, at the last layer, to aid the categorization, I’m using a categorical cross entropy loss function.

Neural Network getting assembled in Keras

This use of categorical cross entropy in the final layer means that we now need our training data to be one hot encoded, i.e. each Win Above Seeding value must be represented as an 11 bit binary value, with only one bit set to 1 (and the rest set to 0) for each category. In this case, a WAS of -4 (corresponding to a 1 seed ignominiously losing in the first round, *cough* *cough* 2018 UVA) would have a value of binary 00000000001, while a WAS of +2 (corresponding to, for example, a 7 seed making it 2 games beyond expectations to the Elite Eight) would have a value of binary 00001000000. The code to set up the training data now consists of adding 5 to the WAS value to make sure all the values are positive and then converting it to a one hot binary representation. Luckily, Keras provides a one-hot function that really makes this process easier:

Setting up the WAS data to use Categorical Cross Entropy

I set aside 94 samples from the dataset to be used as a validation set, and began training with the remaining 550 samples. I initially trained the data for 512 epochs:

Training the March Madness Neural Network

which resulted in an accuracy close to 98%:

Accuracy after training for 512 epochs

but a plot of the training loss showed that we might be overfitting after around 200–300 epochs or so:

Neural network loss levels out after 200–300 epochs, beyond which we are overfitting

As a result, I decided to retrain the neural network for only 256 epochs, resulting in a still solid accuracy of about 93%:

Accuracy after 256 epochs

This is still a very acceptable accuracy, especially when we consider that picking only high seeds gives us an accuracy of only about 76%.

And… there you have it! A new and improved neural network for March Madness that I’m calling “MadNet” (creative name, right?). In the coming weeks, I’ll be writing about another neural network I’ve created that leverages the much maligned box score in a new and creative way, so stay tuned for new articles on this “BoxNet”!

If you like what you read here and would like to support re-HOOP-PER-rate, we are accepting donations of the cryptocurrency Ether at address: 0xC909ce5d714B3cBD98D498331a45A64E62F786bf

--

--