Convolutional Neural Network for March Madness

reHOOPerate
re-HOOP*PER-rate
Published in
6 min readFeb 9, 2020

In my recent article on updating my March Madness neural network in anticipation for the 2020 college basketball tournament, I mentioned some new approaches that I will be using to fill out my bracket this year. One of these is MetaNet, which I’m primarily using to determine bet sizing on specific NCAA tournament games. In this article, I’ll be discussing a convolutional neural network that I’m using for picking March Madness upsets. I’m calling this neural network “BoxNet”.

Why BoxNet? Because it’s using the much maligned box score. In our current era of advanced analytics, the box score seems like a quant and antiquated call-back to a bygone era. Daryl Morey, Houston Rockets manager and rock star of the analytics revolution, even went so far as to suggest that whoever first invented the box score should be shot. As an obvious fan of sports analytics, I was inclined to agree, until I read something interesting last summer that made me think perhaps the box score could still be of use.

Example of a college basketball box score in the middle column

In the book Deep Learning and the Game of Go, the creators of Google’s AlphaGo computer program mentioned that one of the ways they trained their neural network was by encoding the state of a Go board as an image. They then used the next move made by an expert human Go player to train the computer on what the next move should be. In short, they encoded each state of a Go board over the course of the game as separate images. They then trained a convolutional neural network, which are ideally suited to machine learning for image applications (such as in applications like text recognition or self driving cars), to predict what the next move should be given the current state of the board.

Of course, when I read about how AlphaGo was encoding the state of Go boards like they were images, I immediately thought about what other data could be encoded into a regular grid-like image. Being the college basketball fan that I am, my mind naturally gravitated to the basketball box score. In short, I started wondering if certain ratios between box score totals corresponded to teams performing better or worse during March Madness. Is it better for a team to have a single dominant scorer, two (or three) primary scorers, or for the scoring load to be distributed evenly among all the players? Likewise for rebounding: is it better for a team to have a single incredible player down low who specializes in cleaning up the glass, or is it better to have a more balanced ratio of rebounding among different players? And what about the complicated relationships that are possible between these ratios? Add in the ratios of the player who leads in steals per game to the second, third, fourth best players in steals per game, likewise for the ratio of assists, blocked shots, and turnovers, and suddenly there’s a complex relationship that defies mere human consideration.

That’s where the convolutional neural network can come in! Start by normalizing each box score category (points, rebounds, assists, steals, blocks, and turnovers per game) to the team leader in that category (i.e. divide each player’s points per game by the player with the most points per game, likewise for rebounds, assists, etc.). What you have left is a table of normalized statistics for each statistical category — a table that looks a lot like an image. And since convolutional neural networks (abbreviated with the acronym CNN) are used to find patterns within an image, I could use a CNN to identify patterns in the ratios between different players in a box score. I could train this CNN using these normalized box scores of past March Madness teams, and the over or underperformance relative to seeding of those past teams (as explained in my first ever blog post on March Madness neural networks). Whatever patterns the CNN found in the box score could then be used to identify teams with similar box score patterns in March Madness teams this year, and tell us whether those teams are expected to over or underperform their seed. With box scores featuring so prominently in this neural network, it’s only natural that I decided to name it “BoxNet”.

After browsing through a few teams’ box scores on Sports Reference’s college basketball page, I noticed that most teams have a huge drop off in their players’ playing time, scoring, and other statistics after the eighth player or so. It seems like the “core rotation” for most college basketball teams is 8 players, so I limited my box score data sets to the eight highest scoring players. I read in the list of previous teams (both the school name and the year) from my existing college basketball data set, and used the Python code below to scrape this box score data and normalized it within each statistical category:

Scraping the data

One thing to note is that certain teams on Sports Reference had different names in their URL compared to how they are listed (for example, “Louisiana” becomes “Louisiana-Lafayette” in the URL), and some of the code above is to take care of these contingencies. I parsed through the data stats to get the relevant information and stored it all into a list which I called train_list, which I would later reshape into an array.

Now it was time to assemble the Convolutional Neural Network using Keras. Since my data was in an 8x6 format, I used a single convolution layer with 2x2 kernal size, followed by a 2x2 max pooling layer, which I thought could reveal relationships between more detailed box score patterns. I then used two dense activation layers, one with ReLu activation and the next with softmax activation, to arrive at the final performance above or below seeding result. Note that as in previous March Madness neural networks, I already shifted the seed over or underperformance from a range of -4 to +6 to a range from 0 to 11.

Building the March Madness Convolutional Neural Network

After 1024 training steps, the resulting model performed with almost 98% accuracy on box scores of March Madness teams from 2010 to 2019:

Validation accuracy of BoxNet

with overfitting occuring after around 500 steps, at which time the accuracy was leveling off in the high 90-something percentile:

BoxNet overfits after ~500 steps

It seems like there’s some real predictive power behind BoxNet, and there might be some interesting things we can learn about the ideal March Madness team profiles by using genetic algorithms with randomly generated box score data that’s applied to BoxNet. I’ll cover that in a later article!

In the meantime, I wanted to run one more test: in addition to the six box score fields I’m currently using, I wanted to add shooting percentage data (three point percentage, two point percentage, and free throw percentage). Most likely, teams that had many players shooting a high percentage on all three would probably do better than teams that didn’t, but I still thought training a network with this additional data might give additional insight (or at least, might give more hints on how to fill out a bracket!). I called this neural network “BoxShotNet” (since it uses the same box score as before but now includes data on shooting percentages). Trying this neural network out with our current dataset, we saw similar results with the validation accuracy reaching close to 99%, but the network does start to overfit earlier as we would expect.

BoxShotNet overfits after ~256 steps

--

--