Training a Neural Network to fill out my March Madness Bracket

Published in

re-HOOP*PER-rate

17 min readMar 15, 2018

Machine Learning Madness!

March Madness has always been my favorite sporting event. It manages to perfectly combine my favorite sport with the surreal chaos usually found at 3 am raves and college house parties. Add in the pageantry (from ridiculous mid-major mascots to One Shining Moment), the inimitable mania of Gus Johnson, and the impossibly unpredictable gambling element, and you get a uniquely insane American sports experience.

As much as I love March Madness, my bracket always seems to end up torn to shreds by the end of the Elite Eight. There are simply too many upsets that come out of nowhere, and it seems like the Tourney just gets more crazy and random every year. This year, I used data from previous March Madness-es and Google’s TensorFlow library to train a neural network to complete my bracket. That way, when my bracket inevitably gets busted, I can blame a computer AI instead of my own poor judgement. Plus, if TensorFlow is good enough to defeat the world champion of Go, it must be good enough to predict whether a bunch of 18 to 22 year old kids can win six basketball games in a row… right?

From past experience, it always seems like winning a March Madness pool comes down to at least one of two things:

Picking tough and resilient (usually defensive-minded) top seeded teams that don’t get upset to make the later rounds and Final Four (i.e. Duke in 2010, Louisville in 2013, Wisconsin in 2015)
Picking a great lower-seeded team to go on a run and make a series of upsets (i.e. Butler in 2010 and 2011, UConn in 2011 and 2014, South Carolina in 2017)

In other words, a good bracket should have a solid mix of top seeded teams that are “un-upsettable”, combined with lower seeded teams that have high “upsetability potential”. I decided to train my machine learning algorithms to use advanced statistics to classify which high seeded teams tend to be “upsettable”, and which low seeded teams have “upset potential”. To do this, I came up with a metric, “Wins Above Seeding”, that measures both how well a team resists being upset and how good a team is at upsetting higher seeds. In short, Wins Above Seeding (WAS) measures how many more games a team wins than it’s seeding would suggest. For example, a 7th seed would be expected to win a game against a 10th seed in the first round, but would be expected to lose a game against a 2nd seed in the second round. If that 7th seed makes it past the 2nd seed and advances to the Sweet Sixteen, it would have a WAS of +1. If it won another game and advanced to the Elite Eight, it would have a WAS of +2, and if it made it to the Final Four (as South Carolina did as a 7th seed last year), it would have a WAS of +3. Conversely, if a 1st seeded team loses in the second round, it would have a WAS of -3 (since a top seed implies the team is expected to make the Final Four, advancing 3 games past the second round). If I could get data on both how past NCAA teams tended to play, and on how many Wins Above Seeding they were able to obtain, a machine learning algorithm would be able to find patterns that correlate with March Madness success.

Getting the Data

To obtain my training data, I wrote a Perl script to scrape data on advanced statistics from the website Sports-Reference into a simple CSV file. Luckily, Sports Reference marks all teams that advance to the March Madness tournament with a “NCAA” note on the side, so the scraping was relatively painless — check out the code snippet at the bottom to see the Perl script I wrote for this. I decided to use advanced statistics since they are often more revealing than traditional points-rebounds-assists-blocks-steals box score stats, and (more importantly) because they are all rate based. This meant my data would not be affected by apples to oranges comparisons between a slower defense oriented team (think UVA) and a faster paced offense oriented team (think Duke). The 14 advanced statistics I used as parameters for the training (and what they mean) are:

W-L %: The win-loss percentage, or percentage of games the team won during the regular season

SRS: A rating metric that takes into account a team’s average point differential and it’s strength of schedule

SOS: Strength of schedule, a measure of how strong the opponents the team faced were

Pace: Average possession the team plays per 40 minutes

Offensive Rating: An offensive rating that measures how many points the team scores per 100 posessions

Free Throw Rate: The ratio of a team’s free throw attempts to it’s field goal attempts, measures how aggressive a team is at attacking the basket and drawing fouls

3 Point Attempt Rate: The percentage of a team’s field goal attempts that are three point attempts

True Shooting Percentage: A measure of shooting efficiency that takes into account the different values assigned to three pointers, two point field goals, and free throws

Total Rebound Percentage: The percentage of available rebounds a team grabbed (also the percentage of rebounds they prevented the other team from grabbing)

Assist Percentage: The percentage of the team’s field goals that were assisted

Steal Percentage: Percentage of the opposing team’s possessions that ended with a steal

Block Percentage: Percentage of the opposing team’s possessions that ended with a block

Effective Field Goal Percentage: A measure of field goal percentage that accounts for the higher number of points assigned to 3 point shots

Turnover Percentage: Number of turnovers the team commits per 100 plays

Offensive Rebound Percentage: The same as Total Rebound Percentage but applied only to offensive rebounds

There was no straightforward way to scrape the Wins Above Seeding data that will be used to train the machine learning algorithms, so I looked through old brackets and manually picked out the team’s WAS score. It was a fun trip through the ghosts of busted brackets past (sigh, I still can’t believe Kansas got Ali Farokhmanesh-ed). For teams that advanced past the Final Four, I continued adding to their WAS score, resulting in the sum of all WAS scores for a given season being more than 1. I did this so my trained model would reflect the additional good performance of any team that advances past the Final Four and on to the championship.

I restricted my data to the past 8 March Madness tournaments from 2010 to 2017, mainly because the rate of tourney upsets greatly increased starting in 2010. Back in 2007, the average seed of teams in the Final Four was 1.5, in 2008 it was 1.0, and in 2009 it was1.75. In 2010, that average Final Four seeding jumped to 2.6, followed by 6.5, 2.25(the one year of Anthony Davis induced relative sanity in 2012), 4.5, 4.5, 2.5, 3.75, and 3.0 in 2017. Clearly, there was a large jump in upset teams making big runs starting circa 2010. I’m not sure why this change happened in that particular year, though one promising theory is that it has to do with John Calipari’s move to Kentucky causing much of the top talent to join the Wildcats or Coach K’s Blue Devils, leading to a dearth of elite talent in the rest of college basketball and top teams that are more easily upset as a result.

Once I had my data ready, I used my trusty Jupyter Notebook to read it into a Pandas dataframe:

import pandas as pdbball_data=pd.read_csv("/home/liangd/cbb_full.csv")

The scraping script had done its job, and my data was ready for some training:

The csv file read in as a Pandas dataframe

Since the machine learning libraries I’m using prefer working with numpy arrays, I assigned the training parameters to an array called “XTrain” (the columns from team_name to ORB) and the training predictions to an array called “y” (the column labeled OverE). I also used a scaling pipeline to transform the data in these arrays to get them ready for machine learning.

A Few First Experiments

Before unleashing a neural network, I tried to fit the data using a simpler machine learning technique known as a Random Forest classifier. In this approach, the data is repeatedly trained using a series of decisions in a flowchart known as a Decision Tree, and the predictions of these individual decision trees are used to “vote” on a final (ideally correct) prediction. I used the Random Forest classifer provided in the free SciKit-learn Python library:

from sklearn.ensemble import RandomForestClassifier
forest_clf=RandomForestClassifier(random_state=22)

To see how well the Random Forest classifier would do predicting the results of previous March Madness-es, I used k-fold cross validation, a techique in which the training data is randomly split into k different samples, one sample which is kept as test data and the other k-1 samples which are used to train the machine learning algorithm. The resulting algorithm is then used to predict the test data, and its accuracy is reported. This process is repeated with each sample being selected as the test data, and the accuracy of the test with each sample data is reported. For testing the Random Forest classifier, I chose k=3:

cross_val_score(forest_clf, XTrain, y, cv=3, scoring="accuracy")
>array([0.46285714, 0.46745562, 0.4702381 ])

The Random Forest is able to predict the amount by which past March Madness teams win or lose more than expected with an accuracy of 46%-47%, not particularly great. That being said, to be somewhat useful to us, we just need to know which teams are likely to win more than expected and which teams are likely to win less. To do that, we can train the data to make a simpler prediction: whether a team will win more or less than 0 games. To do that, we create a Boolean array that has “true” elements for each team that has a WAS greater than 0, and “false” elements for each team that has a WAS equal to 0 or less:

y_gt0=(y>0)
cross_val_score(forest_clf, XTrain, y_gt0, cv=3, scoring="accuracy")
>array([0.73099415, 0.73099415, 0.75294118])

This looks much better — more than 50% accuracy! But keep in mind that about 24% of all March Madness games result in upsets. That means that even if we predict all high seeds (also known as the Completley Boring Way of Picking a Bracket), we would still have 76% accuracy! All of a sudden, our Random Forest Classifier isn’t looking so great — we’d be better off picking all high seeds like some obnoxious Duke fan.

For what it’s worth, when I use the Random Forest classifier to predict this year’s March Madness teams, it outputs only 2 results: it expects Michigan State to win 2 fewer games than it should (that would means it expects 3rd seeded Michigan State to lose in the first round to 14th seeded Bucknell!) and it expects Duke to win one more game than it should, which would place Duke in the Final Four — maybe it will be a good March for obnoxious Duke fans after all. Then again, the Random Forest is only batting 73% — maybe we can do better. One quick note — I know in a real data science project, right about now the precision/recall tradeoff should be studied in depth and the Confusion Matrix should be examined — but I’m just trying to quickly fill out my bracket here and don’t have time to bother with that, now or in my later experiments.

Let’s try one more quick experiment before we move to the neural network and TensorFlow. A Stochastic Gradient Descent (SGD) Classifier repeatedly takes the derivative of the error function while adjusting coefficients in a linear regression until the error function reaches a local minimum (those of you who somehow managed to remember more calculus than me may recall that going in the direction of a lower derivative moves you toward minimum values of a function). Typing some Python into my Jupyter Notebook gives me:

from sklearn.linear_model import SGDClassifier
sgd_clf=SGDClassifier(random_state=11)
cross_val_score(sgd_clf, XTrain, y_gt0, cv=3, scoring="accuracy")
>array([0.28571429, 0.0295858 , 0.52380952])

Stochastic Gradient Descent finding the minimum in the error function

One of the training sets shows less than 3% accuracy — clearly, the SGD Classifier is practically useless for our purposes. The chances of a team surviving deep into the tourney clearly don’t have a simple linear dependence on any of the advanced statistics we’re looking at. Let’s see if we can do better with a neural network, at least we know we can’t do much worse!

Neural Netball

To get started, I imported the TensorFlow library into Python and created a Deep Neural Net (DNN) with two hidden layers each with 512 neurons, and a softmax output layer with 10 neurons.

import tensorflow as tf
feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(XTrain)
dnn_clf=tf.contrib.learn.DNNClassifier(hidden_units=[512,512], n_classes=10, feature_columns=feature_columns)

Example of a Deep Neural Network. The one I use to predict March Madness has 14 neurons in the input layer, 10 in the output layer and only 2 hidden layers (of 512 neurons each).

I wanted to use a neural network with only 2 hidden layers so that I wouldn’t have to worry too much about selecting different activation functions. Full disclosure, my background is in GPU architecture, and I don’t fully understand the tradeoffs between say, a hyperbolic tangent activation function and a ReLU function — I’m sure the truemachine learning geeks are judging me for this, but what I know is enough for me to let me play with the data. I chose the 512 neurons in the hidden layers after adjusting in powers of 2 from 128 to 256 to 512 and seeing the accuracy of my model consistency go above 98% once I hit 512. I chose 10 neurons in the output layer to correspond to historical team WAS ranging from +6, which would correspond to a lower seeded team winning the entire tourney (this hasn’t happened yet but is conceivable — especially if Collin Sexton of Alabama lights it up from three this March), to -3, corresponding to a 2nd seed losing in the first round or a first seed losing in the second round (not quite conceivable to me that a first seed would lose in the first round, but I may be proven wrong!).

Now that my neural network was all set up, it was time to start training it. Since TensorFlow’s DNN Classifier can only accept integer outputs, I took my previous y_gt0 array and converted it from a Boolean array to integer, and then let the training algorithm go to work. Since this dataset wasn’t particularly large, I could let it run on my Intel Core i5 CPU without distributing the workload to the GPU and still get results in less than 10 minutes.

yi=y_gt0.astype(int)
dnn_clf.fit(x=XTrain, y=yi, batch_size=50, steps=40000)
dnn_clf.evaluate(XTrain, yi)
>{'accuracy': 0.9941406, 'global_step': 120000, 'loss': 0.042456727}

Accuracy greater than 99% — not bad at all! The neural network was clearly better suited to predicting March Madness performance than the Random Forest and Stochastic Gradient Descent classifiers. To determine which teams would make deep runs, I trained the neural to report on which teams would have a WAS greater than 2:

y_gt2=(y>2)
yi=y_gt2.astype(int)
dnn_clf.evaluate(XTrain, yi)
>{'accuracy': 1.0, 'global_step': 200000, 'loss': 0.0007399367}

That 100% accuracy isn’t THAT impressive when you consider that fewer than 3% of all teams have a WAS of 3 or more in the past 8 years (the record holder is Philadelphia 76er end of the bench legend Kevin Ollie’s National Championship UConn team in 2014, with a WAS of +5). That being said, let’s find out which underrated teams the model predicts will make a deep run this year by running the data for this year’s March Madness teams through the neural network. Having already read another csv file with this year’s data into a numpy array called XTrain2018, I ran it through the neural network:

pred_rslt=list(dnn_clf.predict(XT2018))

I then looped through the resulting pred_rslt array to find the index of the teams expected to have a WAS greater than 2:

[i for i, e in enumerate(pred_rslt) if e != 0]
>[0, 9, 28]

The teams the algorithm expected to do well were the ones at index 0, 9 and 28 in my test data. In order to quickly find which teams the neural network was predicting, I created a list to store this year’s March Madness team names in index order:

bd2018=pd.read_csv("/home/liangd/cbb_2018.csv")
Tbd2018=bd2018["Team"]

Now it was time to see these teams:

Tbd2018[0]
'Virginia'

Whoa! Virginia is the first overall seed, and yet the neural network still predicts that it will make a deep run. Clearly, the neural network was trained only using previous seasons’ data, and is unaware of DeAndre Hunter’s unfortunate injury. Nonetheless, as a graduate of Mr. Jefferson’s University, this result made me thrilled (and a bit worried that maybe… just maybe… Google may be feeding data about me into it’s TensorFlow libraries).

Let’s see the other teams:

Tbd2018[9]
'Tennessee'
Tbd2018[28]
'Purdue'

The neural network also likes Tennessee, which is (uncomfortably for me) in the same bracket as UVA, and Purdue, an underrated second seed that nobody seems to be picking to go far. I continued training the neural network to identify teams that correspond to different conditions (i.e. WAS greater than 0, greater than 1, less than 0, less than -1, less than -2). I also re-trained on the same conditions a few times, since the results would change slightly each time I trained the neural network. Each training run resulted in an accuracy greater than 98%.

Here are 10 things I noticed in the predictions:

The neural network REALLY likes UVA. It was always consistently a +1 or +2 team and was among the most highly rated teams. There was no bias on my part here in training the neural net (though, once again, I’m starting to question how much Google really knows about me and whether it has customized the TensorFlow libraries I downloaded…)
Purdue was repeatedly predicted as a top team. They boast a VERY high effective field goal percentage (58.1%) one of the highest block rates in the tourney (12.8%), and a high assist rate (59.3%). The type of teams that do better than their seed tend to combine stingy defense with efficient offense, all while sharing the ball and having multiple players who can come through in the clutch.
Everyone’s upset favorite, Arizona, had the widest range of predicted WAS, consistently ranging between a +1 and a -2. That’s the difference between upsetting UVA and making the Elite Eight, and losing in the first round to Buffalo. It seems oddly appropriate that a team so reliant on a single freshman superstar would end up all over the place in predictions. If Arizona basketball were a stock, I’d be trading it with an options spread right about now.
The neural network really liked two 7 seeds, Nevada and University of Rhode Island. Both teams are among the lowest in turnover percentage, which seems to be a necessity for lower seeded teams that want to make a big run, and like previous March Cinderellas boast a higher than average block rate (see Oregon and South Carolina in last year’s tourney).
There is one team the neural network consistently dislikes, rating it as lower than -2 in almost every prediction: Cincinnati. Cincinnati is also the 2 seed that could face Nevada in the 2nd round. If you’re the gambling type (as I sometimes am), hmm…
Other teams the neural network consistently expects to underperform: Auburn, Wichita State and Texas Tech. All 3 teams combine a lower than average effective field goal percentage with a higher than average turnover rate — not a great recipe for postseason success. Auburn, Wichita State and Texas Tech are also 4, 4, and 3 seeds respectively, positions where teams often don’t make the Sweet 16.
Kansas and Villanova were the neural network’s least favorite 1 seeds, with some predictions having them as low as -3, a loss in the second round. UNC was consistently predicted with a -1, corresponding to a loss in the Sweet 16.
Missouri consistently ranked as -1 or +1, corresponding to an upset of Xavier in the second round or a first round loss. There was no data to reflect this, but Missouri truly seems like the wild card of the tournament to me: with Michael Porter Jr back in action, who knows how far they can go?
Among the lower seeds, 12th seeded South Dakota State was the only team to consistently have a positive WAS in the predictions.
West Virginia and Florida would occasionally pop up in the predictions as teams that could make deep runs with WAS of 2 or more — while rarely showing up as teams that could make a run with a WAS of 0 or 1. These seem like high risk high reward teams that are worth a bet on some Sweet 16 or Elite 8 futures.

In light of all these (and some additional) predictions, my final bracket has Virginia, Gonzaga, Michigan State and Purdue in the Final Four, with Virginia triumphing over Purdue in the National Championship game. Whenever my neural network is unclear on the predictions, I simply choose the higher seeded team (under the proven belief that for the most part, the NCAA Selection Committee does a pretty good job). My full bracket is shown below, but then I decided to do one last experiment… so keep reading to learn more…

… one last experiment…

Since most websites allow you to submit multiple brackets, I decided to try to complete a bracket based on training the neural network to do multiclass classification; that is, a neural network that not only predicts whether a given team can make it past a certain Wins Above Seeding value, but that predicts the exact WAS value a team will have. After multiple rounds of training and a lot of steps, this neural network was also able to reach a accuracy greater than 99% on the previous 8 years’ training data:

dnn_clf.fit(x=XTrain, y=(y+3), batch_size=50, steps=2000000)
dnn_clf.evaluate(XTrain, (y+3))
>{'accuracy': 0.9921875, 'global_step': 1320000, 'loss': 0.059503354}

Note that this time, I’m training the neural network to output data after adding 3 to the original WAS data; this is because the neural network expects to work on data ranging from 0 to 9 (corresponding to the 10 neurons in the softmax output layer I defined earlier), once training was done I shifted the resulting output by -3 to get the data back in the original format.

These predictions were drastically different! UVA was suddenly given a WAS of -3, a second round loss! Vanderbilt also consistently showed up as a second round upset, and Lipscomb was predicted to advance 2 rounds past it’s seeding — a 15 seed making the Sweet 16! The multiclass neural network continued to expect Nevada to make a deep run, predicts an early upset for Cincinnati, foresees an early out for Arizona, and expects Missouri to pull off a second round upset. It also likes West Virginia and Gonzaga’s chances, and also predicted that both Duke and Michigan State would make deep runs (which doesn’t help us much since those two teams would theoretically meet in the Sweet 16).

This (rather different) bracket is shown below, with a Final Four of Duke, Purdue (there they are again!), Michigan, and Kentucky, with Duke triumphing over Michigan in the National Championship game. I ended up entering this bracket in a different Tournament Challenge.

Final Thoughts

The beauty of March Madness is it’s unpredictability. In the end, we’re watching a bunch of young college kids play basketball while figuring out who they are. I know that when I was their age, I certainly… made some unpredictable immature decisions in the process of figuring myself out (I’m in my thirties and am STILL known to make immature decisions on occasion, like staying up late on a work night to train neural networks to predict the outcomes of games involving throwing an orange ball through a hoop). The unpredictability is all part of the fun, and no amount of TensorFlow and GPU optimization will take that away.

That’s why no matter what the algorithms say, I will continue to cheer for, defend against haters (no, great defense ISN’T boring!), and bet on the University of Virginia Cavaliers, my alma mater. Seeing UVA and the Charlottesville community bravely stand up to racism last year makes it all the more rewarding to cheer them on. In closing, all I want to say is: Wa-hoo-wa Vir-gin-i-a!

PS. Here’s that stats scraping Perl script I promised earlier.

#!/usr/bin/perl

my $IND=2017;

while($IND>2009) {

open($IFILE, “<”, “$IND”);

open($OFILE, “>>”, “result.txt”);

while(<$IFILE>){

if($_=~”NCAA”){

@my_array=split/\s/, $_;

if(@my_array[2] =~ “NCAA”){

$offset=2;

$team_name=”@my_array[1] $IND”;

}

if(@my_array[3] =~ “NCAA”){

$offset=3;

$team_name=”@my_array[1] @my_array[2] $IND”;

}

if(@my_array[4] =~ “NCAA”){

$offset=4;

$team_name=”@my_array[1] @my_array[2] @my_array[3] $IND”;

}

$WL_offset=$offset+4;

$WL=@my_array[$WL_offset];

$SRS_offset=$offset+5;

$SRS=@my_array[$SRS_offset];

$SOS_offset=$offset+6;

$SOS=@my_array[$SOS_offset];

$PACE_offset=$offset+16;

$PACE=@my_array[$PACE_offset];

$ORTG_offset=$offset+17;

$ORTG=@my_array[$ORTG_offset];

$TOV_offset=$offset+26;

$TOV=@my_array[$TOV_offset];

$ORB_offset=$offset+27;

$ORB=@my_array[$ORB_offset];

$TS_offset=$offset+20;

$TS=@my_array[$TS_offset];

$SR_offset=$offset+23;

$SR=@my_array[$SR_offset];

$BR_offset=$offset+24;

$BR=@my_array[$BR_offset];

$RR_offset=$offset+21;

$RR=@my_array[$RR_offset];

$FTR_offset=$offset+18;

$FTR=@my_array[$FTR_offset];

$TPAR_offset=$offset+19;

$TPAR=@my_array[$TPAR_offset];

$AP_offset=$offset+22;

$AP=@my_array[$AP_offset];

$EF_offset=$offset+25;

$EF=@my_array[$EF_offset];

printf $OFILE “$team_name,$WL,$SRS,$SOS,$PACE,$ORTG,$FTR,$TPAR,$TS,$RR,$AP,$SR,$BR,$EF,$TOV,$ORB”;

printf $OFILE “\n”;

}

$IND=$IND-1;

}