First, let me start by saying that I know next to nothing about basketball. I’ve played a full game of basketball a total of zero times in my life. My only true exposure is going to some of my sister’s high school games. I barely know the rules of the game, and I certainly don’t follow college or pro basketball.

Despite my complete lack of knowledge, I’m often asked to fill out a March Madness bracket (typically by my more sports-loving family, maybe to help pad the pool…). Historically, I would google a little bit, and then pick basically at random. A few years ago, however, I decided to take a different approach: while I may know nothing about basketball in particular, I do know something about making data-driven predictions. At that time, I was wrapping up my graduate coursework in econometrics, statistics, and machine learning. So I decided to test myself: could I use my quantitative training to predict the NCAA tournament, despite a total lack of domain knowledge?

To be clear, this is not typically an approach I would advocate in other settings: I think the best decisions are data-informed, not blindly data-driven. But in this case, the stakes were low, and I also thought it would be just kind of flashy to show how good an algorithm could do, compared to my more basketball-informed friends and family. So I went for it. And the results of this first trial were… not bad! Despite building the whole prediction system over an afternoon, I stayed near the top of winner board for a good chunk of the tournament.

### Simple, Effective… and Flawed

The basic algorithm was simple: I trained a statistical model that would predict the final score difference between any two teams, based on their in-season aggregate statistics, which I downloaded from Sports Reference. That is, given two teams, A and B, I would predict Score A - Score B, for a generic tournament game, based on A and B’s in-season statistics. If Score A - Score B is bigger than zero, the model predicts Team A wins; otherwise, it predicts Team B wins.

With this basic model in hand, I started from the bottom of the bracket, and predicted the winner of each game. Once the first round was predicted, I moved to the next, all the way to the winner. While this system worked surprisingly well, an early upset that year disrupted all its predictions further down in the bracket. This disruption pointed out two key flaws in the model: first, it did not account for uncertainty, and second, its predictions were local.

Uncertainty is fundamental to data-driven decision making. In this case, not accounting for uncertainty means the predictions did not take into account whether a game would very close, or whether it would be an almost certain win. As long as Score A was higher than Score B, it would predict Team A to win. If we only cared about a single game, this system would work fine. But we care about the whole bracket, and uncertainty at one level should affect predictions in all subsequent rounds.

That brings us to the second flaw of my original model: by focusing locally, on only single games in isolation, the model ignores the fact that all of the games are connected, and that in fact, the most important games are those at the end. Keep in mind, the goal here is to fill out a correct bracket. This can be done by correctly predicting each and every game. However, predicting each and every game is nearly impossible (as I will highlight later). An alternative approach is to consider what we might call a probability distribution over the bracket as a whole, understanding that each level is connected. When combined with modeling uncertainty, we can understand not just how likely it is for one team to beat another in a specific match, but how likely it is for a team to end up in, say, the final four, or the championship.

### Algorithm Revisited

This year, I was invited again to join a bracket pool, so I decided to go back and address those issues with my original method. If you’re curious about the technical details, I’ll include them in a separate post. For now, I’ll describe the basics.

Once again, I based my model on aggregate statistics from in-season play. Specifically, I used data from the “Advanced Statistics” section of Sports Reference from the past five seasons (2012–2016), together with NCAA tournament results from the Kaggle competition. As I said, I don’t really understand basketball enough to know what these statistics mean or how important they might be, but a few examples include the pace factor, which is an estimate of the number of possessions within a 40 minute period, the free throw attempt ratio, and the total rebound percentage.

With this data, I then trained two machine learning models: a regularized generalized linear model for predicting binary outcomes (win/lose) through the R package `glmnet`, and a random forest model from the R package `randomForest`, again for binary outcomes. Both of these models learn relationships between inputs (team statistics) and outcomes (the winner of a tournament game), and can use what they learn to make predictions. Importantly, these models are capable of figuring out which inputs are important, which aren’t, and the exact nature of their relationship to the outcome (positive, negative, strong, weak, etc.). This is important because, again, I know nothing about basketball. Perhaps even more importantly, these models are also capable of describing the uncertainty associated with a prediction: that is, they can say not just which team is most likely to win in a match between A and B, but how probable that outcome is.

By fusing the results from these two models, I now have a way of simulating the probability that one team beats another in a tournament game. As a hypothetical example, the model predicts that my alma mater, the University of Pennsylvania, would have a mere 4.8% chance of beating this year’s tournament favorite, Villanova, in an NCAA tournament game. This, of course, is purely hypothetical, since Penn is not even in the tournament.

With these probabilities in hand, I then simulated the results of the tournament 10,000 times. When simulating the tournament, each game is basically treated as an unfair coin toss, where the probability of seeing heads (i.e. the probability of a given team winning) is based on the model’s probability estimate. When a team’s coin comes up heads, the team advances to the next round in the simulation.

To more concretely understand how this works, I’ll use my Dad’s school, Bucknell, as an example. Bucknell is slated to play West Virginia in the first round. My model predicts that there is a 24% chance of Bucknell winning (sorry, Dad). Thus, for each simulation of the tournament, my computer flips a coin that comes up heads 24% of the time. When it comes up heads, Bucknell advances; when it comes up tails, West Virginia does. It then does a similar simulation for every match in the first round, and then again for all of the teams who have moved to the second round, and so on, until the championship. Because we do this simulation 10,000 times, we can infer how likely a given outcome is based on how many times it occurred across all of the simulations. For example, we can ask how likely it is that Bucknell reaches the Final Four. Again unfortunately for my Dad, Bucknell reaches the Final Four in only 0.7% of the simulations, so we say this event has a 0.7% probability of occurring.

Finally, I can use these simulations to make my predictions. One way you could make predictions is by looking at which bracket occurs the most often. This is similar to what is called the maximum a posteriori, or MAP, estimate of the tournament results. However, this approach does not necessarily take into account that, in many bracket pools, you get more points for predicting the tournament winner, than for predicting a winner in round one. Thus, one may want to devise a different approach that implicitly weights the later rounds more highly in choosing a final bracket prediction. In some sense, deciding how to make this decision is equivalent to what machine learners often call choosing a loss function.

In my case, I made my prediction by first looking at the predicted probabilities of the champion. I then fixed the champion at the most likely value, which also partially fixes the rest of the bracket, as the champion has to win all prior games to become the champion. Then, I looked at the team most likely to end up in the slot opposing the champion in the final game, and again fixed this team, which again partially fixed the rest of the bracket. I kept doing this, moving down through the bracket, until all picks were made.

### Results

While the outcome is far from certain, the most frequent tournament winner across all the simulations was… Villanova. Specifically, Villanova was the winner of the tournament in 14.59% of the simulations. You may ask, only 14.59%? Such a low number underscores how uncertain this whole thing is: there are many paths to the championship, and many opportunities for teams to be knocked out. Even small probabilities of losing at each stage can lead to a large probability of being eliminated before the finals. This is why no one ever predicts this thing correctly.

The top 10 most likely tournament champions, according to the model, are given below, along with their probabilities of winning:

`Villanova              14.59%Duke                   11.39%North_Carolina          8.45%Michigan                8.30%Oregon                  5.45%West_Virginia           5.13%Southern_Methodist      5.00%Wichita_State           4.90%Gonzaga                 4.70%Kentucky                4.63%`

Following my strategy of starting at the top of the bracket and moving down, I next looked at which team will most likely oppose Villanova in the finals. Here are the top 5, again with associated probabilities:

`Michigan           16.49%North_Carolina     15.59%Oregon             11.30%Wichita_State      10.48%Kentucky            9.62%`

So Michigan is Villanova’s most likely opponent in the finals. Notice two things: these probabilities are a little higher than the ones for the champion, but they are still pretty low. Again, there is a huge amount of uncertainty involved in making it to the finals, with plenty of opportunities to lose further down in the bracket.

As we move to the Final Four, we see the same patterns continue:

`              East   F.4 Prob          Midwest   F.4 Prob         Villanova     30.17%         Michigan     27.02%              Duke     26.75%           Oregon     20.49%Southern_Methodist     15.88%        Creighton     13.19%         Wisconsin     10.71%           Kansas      9.10%          Virginia      4.20%       Iowa_State      8.79%`
`             South   F.4 Prob             West   F.4 Prob    North_Carolina     26.13%          Arizona     21.24%     Wichita_State     18.71%    West_Virginia     19.62%          Kentucky     17.00%          Gonzaga     18.16%        Cincinnati     11.93%       Notre_Dame     14.95%  Middle_Tennessee      7.30%    Florida_State      5.97%`

The probabilities here are higher than in the finals, since there is less uncertainty as to what happened in previous rounds. Still, things are far from certain. Also note that here, I include the East and Midwest branches, despite the fact that my method has already fixed Villanova and Michigan to win these two, respectively.

Carrying on like this, I made my picks. You can see the full set of predictions on Yahoo. The final round predictions are:

• Winner: Villanova
• Finals: Villanova, Michigan
• Final Four: Villanova, Michigan, North Carolina, Arizona
• Round of 8: Villanova, Michigan, North Carolina, Arizona, Duke, Kansas, Wichita State, West Virginia

I don’t know enough to know whether or not these picks are surprising. What I do know is, there’s a very low probability that they’re all correct. If there’s one thing I’ve learned from this experiment, it’s that predicting highly connected outcomes like this is complicated, and filled with uncertainty. Nonetheless, I’m very curious how these blindly data-driven picks do, when pitted against the human experts in my family. I guess we will see in the days to come!

For the more technical readers, I’m hoping to post a follow-up with a more detailed description of the methodology and with the R code. When that happens, I’ll post a link here.