Using Genetic Algorithms to gain insights from MadNet, the multi-class classification neural network for March Madness

Published in

re-HOOP*PER-rate

9 min readJan 27, 2020

Every College Basketball Team is a Chromosome

In my last update, I described MadNet, an updated neural network for March Madness bracket predictions. For an explanation of the original idea behind MadNet, check out my original March Madness neural network blog post here. Given that MadNet analyzes statistical data to output how many games a team is expected to outperform (or underperform) their seeding by, I thought it would be interesting to see if we could use it to learn about what statistical traits correspond to teams that do better in March than during the regular season (or conversely, to find out which statistical profiles correspond to teams that do worse in March). However, neural networks are notoriously difficult to interpret and understand. We can show that they have great predictive power, but we don’t know why they make the predictions they do. In fact, this “interpreta-bility” of neural networks is one of the most active parts of research in machine learning today, with important academic papers on the topic being published often. But as of now, a lot of the current research still seems like the blind men and the elephant, with each new learning uncovering some aspect of the problem but no cohesive theory available.

In this blog post, I decided to experiment with Genetic Algorithms as a way to interpret results from MadNet. I generated a random population of team profiles, and then set aside the profiles that had a Win Above Seeding greater than 1:

Using the neural network to find randomly generated teams with Win Above Seeding greater than 1

and I took the resulting 1,464 teams and bred them:

using a random method where there was a 5/11 chance that the data’s field would be selected from each “chromosome” (really just the randomly generated advanced statistics for the team), and a 1/11 chance that the new data’s field would be randomly generated in order to avoid local minima. I then ran the population generated from this round of breeding on the neural network, retaining the teams that had a Win Above Seeding greater than 2 this time. In this case, 72 out of 732 teams made the cut. I then bred those teams (using an approach the same as described above, with some randomization to avoid local minima), and then took the resulting 36 teams and ran them on the neural network, retaining the teams that had a Win Above Seeding greater than 3. This resulted in a final population of 9 teams, all of which outperformed their seeding by a very large amount, as shown below (note each column corresponds to the percentile rank of the relevant statistical field as described in the original March Madness neural network article):

Genetic algorithm generated teams with WAS greater than 3

I used k-means clustering to group these 9 teams into two centers. One of these cluster center teams has a lowly 58% win rate (corresponding to a team that is most likely low seeded), but has a top notch 99% strength of schedule, and a three point attempt rate in the 91% percentile and a 82% percentile steal rate. The team is not good (and is even explicitly bad) at everything else — an indication that lower seeded teams that make a big run to the Final Four are guard heavy, get lots of steals, and shoot lots of threes (even though they don’t make many of them during the regular season, they suddenly rely on a stretch of great outside shooting once March Madness kicks off). The other team has a stellar 97% win rate, but a more mediocre strength of schedule, most likely corresponding to a mid-major team that was able to dominate within its own conference. This team had a high rebounding rate and true shooting percentage, meaning that if you play sub-par competition, you better completely dominate scoring and the glass if you want to advance come March. At the same time, this team didn’t have a steal rate that was as high as the team that had a 58% win rate, perhaps meaning that guard oriented teams are only upset favorites if they performed poorly during the regular season?

Chromosomes can also be good models for basketball teams

I did the same genetic algorithm for teams that underperformed their seed, with 573 out of 100,000 teams performing below a -1 in the first round, 17 out of 288 teams performing below a -2 in the second. In the third round, none of the 9 teams performed below -3, but 2 of the 9 teams still performed below -2:

Worst performing March Madness teams as determined by the genetic algorithm

Both of the teams have a solid win percentage and strength of schedule (which makes sense since these are teams that are victims of huge upsets, and only top 1 or 2 seeded teams have upsets with a WAS of -3 or lower). Both teams have a very high assist percentage (see my previous research for more on the history of high seeds with high assist percentages performing poorly) and also were among the bottom 1/6 in pace. That being said, UVA played with the lowest pace in the country during the previous season and still won it all — so take that finding (like all the conclusions drawn from these neural networks) with a huge grain of salt!

More Experiments with other Machine Learning Techniques

My experiments with genetic algorithms got me wondering: what if I simply took the top performing teams (or worst performing teams) out of a randomly generated set (as determined by the neural network), and performed k-means clustering on them to determine representative “centroid teams” to examine? I first tried this by running 1,000,000 randomly generated neural networks and keeping the 14,023 that had a Win Above Seeding greater than 1. I then performed k-means clustering with 4 clusters on the remaining 14,023, and the resulting 4 cluster centroids had win-loss percentages of 80%, 60%, 77% and 75%, respectively. The centroid team with the highest winning percentage (80%) also had the highest strength of schedule, and played at a pace significantly lower than their offensive rating. They also had a higher effective field goal percentage, and a significantly lower assist rate than the other teams (again, the top seeds with high assist rates tend to get upset). The centroid team with the lowest winning percentage had a high steal rate (steal percentage again! — a necessity for lower seeded teams with poor records to perform well in the tournament), as well as a high assist percentage (lots of assists doesn’t seem to be a problem for lower seeded teams, only the higher seeds). In fact, all 4 of these teams performed well above the mean on steal rate — and interestingly, also had poor turnover rates. It seems like teams with high regular season turnover rates may suddenly find themselves focusing more and tightening up their ballhandling and passing in the postseason? Guess that means it’s important not to shy away from teams just because they’re turnover prone once March comes around.

When I took a look at randomly generated teams with a Win Above Seeding less than -1 (corresponding to a top 4 seed being upset in the first round), 5,819 out of 1,000,000 teams fit the criteria. These teams had both a high rebounding rate and offensive rebounding rate, seemingly indicating that rebounding rate is not a great predictor of March Madness success — and that high seeded teams that rely only on rebounding are most at risk. Perhaps much of rebounding comes down to effort — and every team starts exerting more effort in March?

When I examined randomly generated teams with a WAS greater than 3 (corresponding to a team seeded lower than 8 making the Final Four), the resulting centroid teams — like the teams with a WAS greater than 1 — had a very high turnover rate. It looks like regular season turnover rate REALLY doesn’t mean much come tournament time (maybe this even reflects young underrclassmen guards improving their decision making over the course of the season?). When I took a look at teams with a WAS lower than -3 (i.e. 1 seeds that really underperform) was when I noticed something really interesting:

Cluster centers of teams that really underperform with WAS<-3

all of these teams have a great win-loss record and strength of schedule (as expected of top seeded teams), but they also had egregiously low free throw rates and egregiously low rebounding percentages. Curious, I took a look at recent history and noticed that UVA ranked a frighteningly low 345th out of 351 teams in Free Throw Rate in 2018, when they had the ignominious distinction of being the only 1 seed to ever lose to a 16 seed. The next year, when they won the championship in 2019, their comparative Free Throw Rate had improved to a less extreme 281 out of 351 teams. I decided to take a look at the recent history of every team with a Free Throw Rates ranking below 305 out of 351, and here’s what I found:

2019 —

Wisconsin, 4 seed that loses in a first round upset, ranks #310 in FTR (I even made hundreds of dollars betting against them in this game!)

Michigan, 2 seed that is upset in the Sweet 16, ranks #308 in FTR

2018 —

Kansas, 1 seed that performed to expectations and lost in the Final 4, ranks #328 in FTR

UVA, 1 seed that lost in the 1st round, #345 in FTR

UNC, 2 seed that was upset in the 2nd round, #317 in FTR

Creighton, 8 seed that was upset in the 1st round, #337 in FTR

2017 —

UVA, 5 seed that played to expectations and lost in the 2nd round, #349 in FTR

UCLA, 3 seed that played to expectations and lost in the Sweet 16, #342 in FTR

Iowa State, 5 seed that played to expectations and lost in the 2nd round, #336 in FTR

Creighton, 6 seed that was upset in the 1st round, #335 in FTR

St. Mary’s, 7 seed that played to expectations and lost in the 2nd round, #334 in FTR

Notre Dame, 5 seed that lost in the second round, #328 in FTR

Michigan, 7 seed that went one beyond the expected second round, #305 in FTR

2016 —

Michigan State, 2 seed that was upset in the 1st round, #330 in FTR

Iowa State, 4 seed that performed to expectations and lost in the Sweet 16, #350 in FTR

Out of 15 teams that were in the very bottom of Free Throw Rates, only 1 outperformed expectations, and 7 were upset. Even some of the teams that won barely eeked out victories. It seems like we’re finding a useful rule of thumb here: don’t pick teams among the absolute bottom in Free Throw Rate (i.e. below 305 out of 351 teams) to advance far in the tourney, and maybe even pick the high seeded teams with abysmal Free Throw Rates to be upset. In fact, it seems like it might be a good idea to bet against the spread on the lower seeded team whenever they’re playing a higher seeded team!

Somehow, the moment a team raises their FTR above the absolute basement, they start winning, like UVA in 2019 (281st ) or Villanova in 2018 (293rd). Overall, examining these “centroid teams” have shown that teams that have a high three point rate tend to both underperform AND overperform a lot, which confirms the intuition that 3 pointers tend to introduce variance to the game. In addition, teams that did better tended to play at a below average to slightly above average pace, and successful teams tended to have a pace ranking that was lower than their offensive rating (in fact, a cool rule of thumb for seeing how a team measures up come March is to take their offensive rating ranking versus other teams that season, and divide it by their pace ranking versus other teams that season). All that being said, the low FTR tendency just might be the coolest thing I’ve found from experimenting with applying different random data to the MadNet March Madness Neural Network!

Using Genetic Algorithms to gain insights from MadNet, the multi-class classification neural network for March Madness

Written by reHOOPerate