This year, Nate Silver’s FiveThirtyEight challenged readers to predict NFL game results better than its forecasting algorithm. The result? Less than 2% of the 20,352 readers who participated bested FiveThirtyEight’s Elo algorithm.
I was one of them.
Using a combination of data sources and 30 different projection models, I ranked 190th out of 20,352 participants. FiveThirtyEight’s Elo placed 432nd. Nate Silver himself placed 406th.
Here’s the spreadsheet I used to make my way toward the top of the leaderboard — keep reading to learn how I did it.
How FiveThirtyEight’s NFL Predictions Game Works
FiveThirtyEight’s NFL predictions game is the first of its kind for gamifying NFL results. Unlike most NFL betting markets, fantasy football challenges, or pick’em competitions, FiveThirtyEight asks you to make a probabilistic forecast for each matchup, picking A) the winning team and B) how confident you are that your chosen team will win. FiveThirtyEight explains:
“After each NFL game finishes, you’ll either gain or lose points based on whether you picked the winning team and how confident you were that it would prevail. The higher a win probability you assign a team, the more points you can earn — but also the more you can lose.”
The scoring system is based on Brier scores that reward the accuracy of forecasts and highly punish overconfidence. The game kept track of each player’s points total throughout the season, providing leaderboards for the entire season, each week, the second half of the season, and the postseason, when points were doubled.
My Data Sources
To beat FiveThirtyEight’s forecasts, I incorporated two other data sources into my process: Yahoo’s crowd pick distribution and a composite of Vegas bookmaker odds.
Yahoo’s crowd pick distribution is available as part of its Pro Football Pick’em game. It shows the distribution of picks by the more than 36,000 fans who play each week.
These pick distributions are of course not win probabilities, but I hypothesized that they could be a useful addition to my forecasts as a measure of popular support for a team.
The second data source I used was a composite of Vegas bookmaker odds. Similar to the Yahoo pick distribution, this could act as a measure of the crowd, with potentially even more accuracy given that it represented real bets with actual money on the line.
Instead of using a single bookmaker’s odds, I searched for a source that aggregated multiple bookies, in an attempt to neutralize any single bookmaker’s bias over the course of a season. I was in luck when I found OddsPortal.com, a European betting odds monitoring service. It provides both historical and real-time average odds from 13–14 bookmakers for free.
I converted OddsPortal’s American (a.k.a. “moneyline” or “US odds”) to probability percentages with the following formula, where z is the moneyline:
For favorites: z/(z-100)
For underdogs: 100/(z+100)
I rounded the odds to a whole percentage and then subtracted the winner’s probability from 1 to ensure the two added up to 100% (the odds included the bookies’ cut and would not have added up to 100% on their own).
Benchmarking with the 2016–2017 Season
To get a sense of the historical performance of each data source, I gathered data from the previous 2016–2017 NFL season. This would serve as my benchmark for experimenting with different combinations of data sources into various projection models.
I utilized the Wayback Machine to get historical Yahoo crowd pick distribution data. Luckily, most of the season and part of the postseason was available.
I started copying and pasting the data from each source into this Google spreadsheet. I dug into FiveThirtyEight’s game code on GitHub to replicate the way they calculated points using Brier scores. Here’s how that code converted into equations in my spreadsheet:
=(probability − outcome of game)*(probability − outcome of game)
where outcome of game = 1 for winning team and 0 for losing team.
Game points calculations:
(This includes the rounding tweak that FiveThirtyEight added a few weeks into the season.)
Let’s look at an example: the first game of the 2016–2017 season, a Super Bowl rematch between the Carolina Panthers and the Denver Broncos.
In the top image, you can see that the Yahoo crowd heavily favored Carolina at an 81% chance of winning the game. FiveThirtyEight’s Elo, having changed little during the offseason since Denver’s Super Bowl win, favored the Broncos at 60%. Vegas bookies put Carolina as nearly mirror image favorite, at 59%. Denver won the game.
In the bottom image, points were calculated in accordance with Denver’s win. FiveThirtyEight’s Elo performed the best, with a Brier of 0.16 (closer to 0 is better) and so it received 9 points. The Yahoo crowd performed the worst, being quite overconfident in its incorrect choice of the Panthers, with a Brier of 0.66 and a loss of 40.6 points. Vegas also lost points, but because it was not as confident in its choice of the Panthers, it only lost 9.8 points.
I calculated the Brier scores and points for all three data sources for 262 games in the 2016–2017 season. (I excluded only the divisional championship games in the playoffs and the Super Bowl, due to data not being available on the Wayback Machine for Yahoo crowd distributions.)
If FiveThirtyEight had launched their game last year, these three data sources on their own merits would have performed as follows:
- Vegas Odds Composite: 880.5 points
- FiveThirtyEight Elo: 800.2 points
- Yahoo Crowd Pick Distribution: 278.5 points
Vegas was more accurate than FiveThirtyEight. Both were far better than Yahoo’s crowd distribution.
I could have decided right then and there to simply use Vegas odds as my guide for this year’s game, but was curious as to whether variations and combinations of the three data sources would produce a higher score.
To beat FiveThirtyEight’s Elo forecasts, I experimented with 30 different methods that mixed, matched, and adjusted the three data sources at my disposal, applying each to the data set of 262 games during the 2016–2017 season. To keep things simple, I did not double points for playoff games. Here’s a brief explanation of each model and how it performed:
- FiveThirtyEight Elo (800.2 points)
- Yahoo Crowd Pick Distribution (278.5 points)
- Vegas Odds Composite (880.5 points)
- Model 1: An average between FiveThirtyEight and Yahoo. Despite combining FiveThirtyEight with a data source that performed significantly worse on its own, when averaged together, it performed slightly better (816.0 points vs. FiveThirtyEight’s 800.2).
- Model 2: Midpoint between FiveThirtyEight and 50%. Given that the game penalized overconfidence, perhaps neutering Elo a bit would help it perform better. It did not. (635.2 points)
- Model 3: Where Leader disagrees with FiveThirtyEight, make no pick. This tied into an exercise I did last year, where I used 200 experts and Reddit’s comment ranking algorithm to win my office NFL pick’em pool. I tracked which expert performed the best throughout the course of the season and, for this model, if the leading expert at the time disagreed with FiveThirtyEight, I didn’t make a pick at all. My hypothesis was that two sources I trusted didn’t agree and therefore I shouldn’t have any confidence as to which team to pick to win. (808.8 points)
- Model 4: Where Leader disagrees with FiveThirtyEight, make mirroring pick. This hypothesized that the leading expert was in fact better than FiveThirtyEight at picking winners. Since I didn’t have exact probabilities from these leading experts, when they disagreed I simply mirrored the FiveThirtyEight probability as their own. (772.2 points)
- Models 5–11: Varying degrees of “turning the dial” up on FiveThirtyEight. These models took FiveThirtyEight’s predictions and bumped them up a notch, by 5, 10, 20, 30, 40, 50, or 100%. The idea was that if FiveThirtyEight was so good at picking winners, then why not go all in on its winner? Obviously, the predictions game harshly punishes overconfidence, so these were some of the worst-performing models. (+5%: 756.1 points, +10%: 645.2 points, +20%: 286.7 points, +30%: -245.0 points, +40%: -846.7 points, +50%: -1431.1 points, +100%: -2750.0 points)
- Models 12–13: Bump up to 100% confidence if FiveThirtyEight above X. These models cranked up the dial to 100% in favor of a single team if FiveThirtyEight was already above a certain forecast. For example, if FiveThirtyEight gave a team higher than an 85% chance of winning, I thought I might gain a few extra points by moving the dial up to project that team having a 100% chance of winning. It performed only marginally better. (>85%: 809.1 points; >86%: 805.1 points)
- Models 14–15: Bump up to 100% confidence if Yahoo above X. Similarly, these models looked at Yahoo crowd distribution, and if more than 95% of the crowd favored a particular team, bumping the dial up to 100% in favor of that team garnered a significant edge. Though not beating Vegas, these models outperformed FiveThrityEight’s Elo. (>95%: 853.0 points; >96%: 846.2 points)
- Models 16–18: Compare FiveThirtyEight to the crowd; if a significant difference, go with the crowd. The idea here was that FiveThirtyEight’s Elo algorithm doesn’t include some information that the crowd might have, such as injuries, player rest, or other intangibles. Therefore, if I saw a big difference between the Yahoo crowd and FiveThirtyEight, I could presume that there might be something the crowd knows that could put it at an advantage. The sweet spot seemed to be a difference of 32 percentage points — if that gap existed, I could find an edge and go with the crowd. I attempted a few variations on this idea. (If gap, use Yahoo: 899.5 points; if gap, use Yahoo 100%: 805.2 points; if gap, average Yahoo & FiveThirtyEight: 766.1 points.)
- Model 19: Duplicate model; removed.
- Models 20–22: Raw data source averages. This was my first big breakthrough. I identified that averaging all three data sources for each game improved the score dramatically. Averaging just Yahoo & Vegas or just FiveThirtyEight & Vegas didn’t score as well, but were still promising. (Average of all three: 929.5 points; average of Yahoo & Vegas: 837.7 points; average of FiveThirtyEight & Vegas: 907.8 points)
- Models 23–31: Compare FiveThirtyEight to Vegas; if a significant difference, then choose Vegas or an average of Vegas with others, otherwise use an average. This leveraged the learnings from the previous two types of models, this time evaluating the difference between Vegas and FiveThirtyEight instead of Yahoo and FiveThirtyEight. If the difference between Vegas and FiveThirtyEight passed a certain threshold, then I knew there was something the market knew that FiveThirtyEight didn’t, and so I used Vegas outright or used average to offset the uncertainty between the various data sources. This type of model performed really well, up to 28% better than FiveThirtyEight’s projections and up to 16% higher than Vegas alone:
The best performing model for the 2016–2017 NFL season was Model 27, which followed this logic:
- If the difference between FiveThirtyEight and Vegas is more than 13 percentage points, then take an average of the two sources.
- If the difference is between 11-13 percentage points, then use Vegas.
- If the difference is 10% points or less, then take an average of all three sources (Vegas, FiveThirtyEight, and Yahoo).
The problem with some of the models in this last category was that they started to become prone to overfitting. Overfitting occurs when the model too closely fits the given data set and may not be grounded sufficiently in principle to perform similarly well with other data sets. For example, for Model 27, there’s no logical reason I can think why differences in more than 13 percentage points should be averaged between FiveThirtyEight and Vegas but not Yahoo, 11–13 should use Vegas alone, and less than that should use the average of all three. It’s more likely than not that randomness in this particular data set was causing improvements for arbitrary single digit differences.
Nonetheless, the underlying hypothesis of this last type of models remained a promising one: when there’s a significant difference between Vegas and FiveThirtyEight, the market might know something FiveThirtyEight doesn’t.
2017–2018 Season Strategy & Results
Now that I had a sufficient set of models benchmarked against nearly an entire season of football, I could start playing FiveThirtyEight’s game knowing that I had an edge.
Each week throughout the 2017–2018 season, I collected my three sources of data (Yahoo Crowd Pick Distribution, FiveThirtyEight Elo, and Vegas Odds Composite), pasted them into my spreadsheet, and calculate probabilities for each of my 30 models.
For most of the year I used Model 25, with the exception of Week 1 when I used Model 1 (average of FiveThirtyEight and Yahoo crowd) and Week 8 when I was out of the country and copied FiveThirtyEight for lack of being able to gather my usual data.
I chose Model 25 because it ranked 4th among my 30 models for the 2016–2017 season, but had sufficient fundamentals for me to believe that it wouldn’t be overly prone to overfitting.
Here’s how Model 25 worked: it first considered whether there was a difference of more than 10 percentage points between the forecasts of FiveThirtyEight and Vegas. If there was a gap, I would use the forecast as dictated by Vegas. If there wasn’t a gap, I would use an average of all three original data sources.
For example, let’s assume FiveThirtyEight gives a team a 55% chance of winning, Vegas gives the same team a 70% chance, and Yahoo crowd pick distribution favors the same team at 80%. Since there was more than a 10 percentage points difference between FiveThirtyEight and Vegas, I would use Vegas’ projection, 70%, as my own. If instead Vegas gave the team a 60% chance of winning, I would average all three for 65%.
How did this method perform? Over the course of the 2017–2018 season, FiveThirtyEight Elo’s garnered 914.6 points, which placed it in the 98th percentile overall at 432nd place.
In contrast, I achieved 996.2 points, placing in the 99th percentile overall at 190th of 20,352 participants:
I placed 87th on the second-chance leaderboard:
I didn’t do so hot in the postseason, placing in the 40th percentile:
My best single week was Week 7, when I placed 53rd overall, in the 99th percentile:
As well as I performed, if I had used Model 25 consistently throughout the entire year, I would have earned nearly 100 more points for a total of 1,099.9 points in 78th place, compared with my finish in 190th place with 996.2 points.
Unlike the 2016–2017 season, when various models performed better than any single data source, this season Vegas single-handedly outperformed the rest. It earned 1,230.1 points, which would have landed it in 16th place out of 20,352 participants!
Here’s how all 30 models and 3 data sources would have ranked against each other:
(Note that Models 3 and 4 weren’t performing well enough to continue to invest the time it took to put the model together, so I stopped using it a few weeks into the season.)
Looking Ahead to Next Season
With this season in the books, I now have a total of 528 games in my database spanning the last two NFL seasons. Interestingly, while Vegas performed only mildly well during the 2016–2017 season, with another season under its belt it has performed 23% better than FiveThirtyEight.
Assuming FiveThirtyEight continues the game next year, I will likely just follow the market and use the Vegas odds composite to make my forecasts. As a rule of thumb, beating the market is challenging. But when I’m competing against individuals who anchor their selections with FiveThirtyEight’s Elo, simply using the market is a powerful tool.
It’s worth noting that the market doesn’t necessarily beat Elo in the long run. Nate Silver has commented that in backtesting Elo against the market, it beat the spread only 51% of the time.
I would be curious to know how other players on the leaderboard beat the market. If you played the FiveThirtyEight NFL predictions game this year and are willing to share your approach, comment below or get in touch (@dglid here and on Twitter).
I’m considering incorporating a few other data sources next year, namely ESPN’s NFL Football Power Index and Football Outsider’s DVOA Ratings. Unfortunately, FPI is quite cumbersome to collect week after week, and DVOA doesn’t account for individual game matchups. If you know where to get this data easily and effectively, let me know.
On Developing Probabilistic Forecasting Skills
FiveThirtyEight’s NFL predictions game is one of the best applications I’ve seen for teaching probabilistic forecasting. The game incentivizes careful consideration of all factors and discourages pundit-like overconfidence of all-or-nothing predictions.
As the season progressed, I became humbled by upsets over favorites I had at 70%, 80%, and even 90%+ likely to win. I became attuned to the pain of losing so many points when being so seemingly sure about the outcome.
If there’s one thing I’ve learned, it’s that a 70% chance of winning isn’t a sure thing at all. If only we all had learned that prior to the 2016 election…
Congrats to Dave Dexter for for winning the game, Han Zhang for being at the top of the leaderboard for so many weeks, Neil Paine for making us all think the game was rigged, and Terry Zhang for his persistent pursuit of the 0th percentile.
Thanks to Justin Carstens for the great conversation in the comments section throughout the season.
Finally, many thanks to my good friend Jay Sher for alerting me to something he knew would be up my alley.
Like this post? Check out last year’s post: How I used 200 experts and Reddit’s comment ranking algorithm to win my office NFL pick’em pool. How’d I do this year? Not so hot — I came in 4th place in my pool, a few wins below both Vegas and FiveThirtyEight.