An Intro to Quantitative Modeling for Sports Bettors (in Excel)
The explosive rise and spread of legalized sports betting in the United States has prompted many to develop (or renew) an interest in a more quantitative, data-driven methodology for predicting the outcomes of sporting events.
Intro to Monte Carlo Simulations
Historically, any sportsbook operator generating their own odds (as opposed to just copying bet365 or Pinnacle markets) or posting live projected win probabilities (e.g. ESPN Gamecast) has done so using some form of stochastic simulation, most commonly a Monte Carlo simulation. This is a method for iteratively evaluating a deterministic model using sets of nondeterministic (i.e. random) numbers as inputs.
To make sure we all understand what this means in this particular context, suppose I asked what the probability was of rolling a 1 on a single throw of a six-sided die. You likely know that immediately to be 1/6 or approximately 16.7%. However, for the less quantitatively-oriented, we could:
- Roll a six-sided die, perhaps 100,000 times
- Tally up all the results
- Convert those to frequencies
- And then say that, given our sufficient sample size, we are reasonably certain that the frequency with which each outcome occurred during our trial is indicative or predictive of the frequency with which it will occur in the future
Of course, no one is going to sit around a roll a die 100,000 times, nor could they do so in an independent and identically distributed (iid) fashion. Fortunately, all of your favorite spreadsheet programs and programming languages allow for this to be done many many times, very very quickly. You can see the syntax for one iteration in Excel and Python underneath the graphic above, just for the sake of illustration.
An Example: Yankees vs. Red Sox (June 29, 2019)
How does this apply to sports? Believe it or not, most odds, win probabilities, and score projections are generated by simulating matchups between teams or players in this exact fashion. We can illustrate this using as an example the matchup between the New Yankees and Boston Red Sox that took place at London Stadium in the United Kingdom on June 29th, 2019. If you want to follow along or toy around yourself, you can download the macro-enabled Excel workbook here.
- I first pulled both the Yankees and Red Sox game logs for the regular season until that point from baseball-reference.com
- I then calculated the average number of runs each team had scored per game until that point
- I also included how many runs those teams allowed per game
- I then applied a very common transformation intended to account for the fact that, given any team’s offensive output, the quality of the opponent’s defense will affect the likelihood with which you would expect them to perform at a particular level. We will consider this value our mean (μ) going forward.
- We will, of course, want to include the standard deviation or variability in offensive output. In a more robust model, we might consider applying a similar adjustment to our standard deviation as we did to our average. In theory, the variability in a team’s Runs Scored should impact the variability in their opponent’s expected Runs Against, and the variability in a team’s Runs Against should impact the variability in their opponent’s expected Runs Scored.
- Then, we will use the inverse of the cumulative normal distribution function, which takes 3 parameters:
Congratulations, we have now “simulated” a matchup between the Yankees and Red Sox in which the Yankees won by a score of 10.147 to 7.945.
However, just as with our dice example, we want to run this simulation not once, but, perhaps 10,000 times:
Finally, we can calculate the frequency with which each time won in our simulated matchups and convert or compare those to odds for betting purposes.
A team that wins with a probability of 53.8% would be represented at fair odds of -116 (1.86, for international readers), while their counterpart with a win probability of 46.3% would be represented at fair odds of +116 (2.16). With standard juice applied, this would likely hit the market at Yankees -129, Red Sox +107.
If you’re noticing that the two probabilities add up to 100.1%, that is only because of rounding for the sake of this graphic.
The market on bet365 opened at Yankees -135, Red Sox +114, which is quite close to what our very primitive model has suggested thus far, though there may have been some luck involved. At this point, it is too early to tell but skepticism is often the prudent viewpoint.
Looking Under the Hood
This tactic of first creating a mathematical representation of an event, and then iterating through it over and over is a standard part of any data scientist’s toolkit and is used to make predictions that you might interact with regularly. Weather forecasts and economic projections are two very common examples, but for some it is not entirely intuitive why the method is effective in the first place. If this does not describe you, feel free to scroll down to the section titled Home Field Advantage.
We recall from statistics that normally distributed data sets follow a bell curve, under which a specific proportion of values can be found within a given distance from the mean.
Using only the mean and standard deviation of any normal distribution, we can construct a chart like the above. In the context of the model we’ve built thus far, this chart can be read as the likelihood with which the Yankees (or a team statistically identical to the Yankees) would score any number of runs against the Red Sox (or a team statistically identical to the Red Sox). Since normal distributions are symmetric about the mean, we would say there is a:
- 50% chance of scoring up to 5.142 runs
- 50% + 34.1% = 84.1% of scoring up to 8.143 runs
- 50% — 34.1% = 15.9% chance of scoring up to 2.141 runs
You’ll notice that in the case of the Yankees at this point in 2019, moving two standard deviations to the left of the mean yields a prediction of a negative score (-0.860). We will come back to this momentarily.
When we were building our simulation, we used the inverse of the cumulative normal distribution function along with 3 parameters. The cumulative normal distribution function itself shows, for a given distribution and value of x, the probability of a randomly selected value being less than x. In other words, what percentage of the data falls to the left of x. One final way to internalize this that is relevant here is to imagine choosing a value x along the x-axis and then asking what the probability is that a randomly selected point under the curve will be to the left of that x value.
The inverse of this function does…the inverse. It allows us to input a distribution (parameterized by a mean and standard deviation) and a percentage or probability, and then produces as output the x value for which the supplied percentage of data falls to the left.
The percentage or probability that we are supplying, in this case in Excel, comes from a (pseudo) random number generator that outputs a value greater than or equal to 0 and less 1, notated mathematically as [0,1). When the simulation was run to construct these graphics, that random value produced by Excel for the Yankees was ~0.95234.
One interpretation of this is that, for a bell curve representing a distribution with a mean of 5.142 and standard deviation of 3.001, 95.234% of the values fall to the left of 10.147. Another way of saying this is that the Yankees (or a team statistically identical to the Yankees) will score 10.147 or fewer runs against the Red Sox (or a team statistically identical to the Red Sox) in 95.234% of matchups.
As long as the assumption of a normal distribution holds, iteratively using random values between 0 and 1 to generate simulated scores for each team will give us a good approximation of how the game we parameterized is likely to play out (hopefully). For other sports and events, a normal distribution may not be ideal. For very low-scoring sports like soccer, it may make more sense to use a Poisson distribution or a negative binomial distribution to model the game, rather than a normal distribution.
In this particular case, as was briefly referenced earlier, it is possible to obtain a negative number of runs scored. For the purposes of the model we are building, we will employ logic that converts any negative scores to a score of 0.000. This is a place where we see the art of modeling truly fusing with the science. Perhaps a simulated score of Yankees: -4, Red Sox: -3 is just as informative as a simulated score of Yankees: 5, Red Sox: 6. Or, perhaps one wants to completely remove the entire simulated matchup in the case that either team produces a result that would be impossible in the real world. Stepping back for just a moment, we see this is not the only aspect of our model that leaves room for creativity and also for error. We did not take into account injuries, momentum, weather, stadium dimensions, starting pitchers, and a host of other important data points that may impact our predictions.
Often, one of the best ways to learn to build models is to start with the minimum viable product and then incrementally add to and scale it. In this case, we will first look at incorporating home field advantage as a feature of our model.
Home Field Advantage
During the 2018 MLB regular season the home team won 1,277 games (52.6%), while the away team won 1,149 games (47.4%). Thus, someone building a model amidst the 2019 season may ascribe a 2.6% incremental win probability to a team playing at home, all else being held equal.
This is not a home game for the Red Sox, as it was played in London at a neutral site, but suppose it was. One option would have been to only use Runs Scored and Runs Against data from Red Sox home games and Yankees away games in the first place, so that no further modifications need to be made. In some cases, this would be a viable approach, though we may hesitate to think this is one of them, especially because of the limiting factor it has on an already-small sample size.
Suppose our research revealed that, all else equal, home field advantage can be expected to add 0.32 expected runs to a team’s output, after adjusting for the quality of the opponent’s defense. It is particularly easy to incorporate this into our model by simply adding 0.32 to the Adj. Runs Scored for the team with home field advantage.
This occurs before the value is passed as a parameter to the inverse of the cumulative normal distribution function, (NORM.INV, in Excel parlance), so no further adjustments are needed. The same can be said for any factor that adds or subtracts a specified absolute contribution to or from the expected output. For example, if a particular team scores 0.73 fewer runs in games immediately following a cross-country flight, and the game we are trying to model fits that criteria, it would be very easy to incorporate into the simulation.
In our case, neither team had home field advantage, so we will not use such a factor, but it is still an option that can be toggled into our Excel model.
An Ensemble Approach to Modeling
Not only can we adjust our model for factors like home field advantage, but we can also layer on other methodologies in order to take more of an ensemble approach to modeling, whereby multiple different models and/or datasets are used to predict the same outcome.
For example, Bill James and SABR have published a formula for calculating probabilities of victory in head-to-head MLB matchups, based solely on winning percentages. The derivation is clever and worth reading, and yields the following:
In our case, the Yankees and Red Sox had winning percentages of:
Under this formula, we have:
At this point, we have a win probability for each team from our Monte Carlo simulation…
…which may or may not include a home field advantage as a factor, and another win probability for each time from the SABR formula:
We could simply assign them equal weights such that:
However, we may want to apply any number of other approaches to weighing these two values. Perhaps there is historical research which would be helpful. Alternatively, we could backtest a number of different weights on other matchups.
Leveraging Random Number Generators (RNGs)
We saw the power of using random number generators to approximate future probabilities, but can also use a very similar technique to zero in on the ideal weightings, as long as we have a goal in mind. One such goal would be most closely mirroring the bet365 opening line, which was Yankees -135, Red Sox +114. Without such a goal, how would one even define what it means to associate an “ideal” weight with a given probability?
We need to correct for the vig, or house edge, being charged, because that has nothing to do with the predicted outcome of the game itself and more to do with the risk management and financial model of the operator.
One way to do this would be to remove the house edge from the bet365 line in order to find the fair odds, convert those to probabilities:
…and then find the weightings,
Another option, which we will pursue for the sake of explanation, is to calculate the bet365 house edge…
…and use that factor to add juice to the fair odds produced by our model, and then tweak the weightings so that those fair odds, once adjusted by the bet365 house edge, converge to the actual opening bet365 lines.
We start by using the same random number generator from our Monte Carlo simulation, in Excel this is =RAND(), which we recall produces a random value in the range of [0,1). We will assign a random value to the weighting…
…which will multiply our Monte Carlo-based win probability…
…and then assign 1 minus that value…
…to the weight on our SABR-based win probabilities:
For our first iteration or trial, we will produce one implied winning probability for the Yankees:
…and one for the Red Sox:
Once we generate other random numbers in order to produce probabilities pairs of win probabilities,
…we will further hone the weighting.
For the sake of analytic hygiene, we should quickly check that for any trial n,
If not, we should figure out what has gone wrong before proceeding.
At this point, we want to convert our probabilities to fair odds, and will start with decimal form for the sake of simplicity, as the initial conversion is quite straightforward.
The more difficult, and often less intuitive step concerns adding the house edge, which we will do before converting to American odds.
This would be a perfect point for anyone who doesn’t understand how house edge (also known as “vig”, “vigorish”, or “juice”) is calculated, and how that stands in relation to “overround”, as well as what each can be used for computationally. Anyone interested in a deep dive can pause, and read this article, which will leave you in the perfect place to jump right back in at this exact place.
Accounting for the House Edge
If you just left to read the deep dive (or to do anything else, for that matter) welcome back. We now want to apply the bet365 house edge or vig to our decimal odds, which we will then convert to American odds and compare to their opening market.
We recall the opening odds of Yankees -135, Red Sox +114. We can convert to implied probabilities as follows:
And then sum them to arrive at our overround (𝕆). This is actually as far as we have to go:
Once we divide our fair odds by the overround (𝕆), we will see them reflected as they would have been if bet365 applied their house edge to the probabilities produced by our model.
As a reminder of where we are in this spreadsheet before showing these last four calculations:
At this point, we just need to convert from decimal odds to American. We will designate the American odds produced by the nth iteration as:
As is the case when converting between implied probabilities and American odds, we will need two separate equations here, one for the favorite and one for the underdog. The general forms are as follows:
Thus, we have:
To step back and take stock on what we’ve done thus far:
First, we used our Monte Carlo Simulation and the SABR formula to generate two win probabilities for both of our teams:
Next, we set out to find the weights,
…that, when respectively applied to the relevant winning probabilities, would most closely resemble those implied by bet365’s opening odds,
We set out to iterate through multiple attempts at finding the proper weights. In the first trial, we used a random number generator to produce values for
…to produce the first pair of weighted winning probabilities:
Those were then converted into decimal odds, to which a house edge was applied via our discussion on overround (𝕆), before finally converting to American odds.
At this point, we want to iterate through this process an arbitrarily large number of times, N. The neighborhood of 10⁴ < N < 10⁵ feels roughly correct for this use case, sample size, and number of features, but we will use 10³ in order to keep our file small enough for everyone to download and play with.
We can quickly aggregate some summary statistics:
Since we have defined:
…we know the weights are perfectly correlated and thus only one requires inspection in our analysis.
If we sort our trial table by the American odds, which have had the house edge applied, we can focus on those that yielded odds in a narrower range (i.e. one closer to our desired output). Perhaps, for example, we want to focus on weightings that yielded odds on the Yankees moneyline between -130 and -140, inclusive. When we do this, and recalculate our summary statistics, we see the following:
We quickly notice that values in the range of
…produce the output we are looking for.
Recall that =RAND() produces values greater than or equal to 0 and less than 1. We now know that we are only interested in about 25% of this range, and thus that we must be wasting some of our precious computational power and valuable time.
So, we now want to run through this process again, but rather than using a random number generator that produces weights between 0% and 100%, let’s use one that only produces weights between 69% and 93%, inclusive.
When it comes to integers, this is always easy, not only in Excel but in many common programming languages. In Excel, for example, if we wanted random integers between 69 and 93, we would use the function =RANDBETWEEN(69,93). Unfortunately, for decimal values, we need a slightly more complicated function.
One common method for generating random decimal values between X and Y is to use
…which will return X when RAND()=0, return Y when RAND()=1, and values in between otherwise. Though this syntax is specific to Excel, a similar workaround is generally applicable to most programming languages.
For our purposes, we will again generate 10³ (or 10⁴ or 10⁵) random values, but they will all be between 0.69 and 0.93:
We can continue this process until we feel we have a narrow enough range. In Excel, we can use the Filter function to select only rows where the Yankees moneyline is -135, and also where the Boston moneyline is 114. This quickly compresses our range, and allows us to hone in on our optimal weight.
We quickly see that in fact…
At this point, given the limitations of this highly simplified model, we should be happy with our progress. Next steps would include backtesting, paper trading, sensitivity analysis, and, of course, adding more features.
One other popular analytical technique that only relies on Runs Scored and Runs Against data is called Pythagorean Expectation, which states that:
I have laid the groundwork for adding this feature to the model, but left the hard work to anyone who is interested in playing around.
Automation & Machine Learning
If the process of iteratively generating arrays of random numbers and manually narrowing their range feels onerous and susceptible to human error, you’re not alone. The reason I chose to spend my time writing and ask you to spend your time reading that section, is because this framework and approach is very similar to the way in which Machine Learning Engineers build artificial neural networks, even if substantially oversimplified. Specifically, this begins to resemble gradient descent, the optimization technique that underlies a great deal of the Machine Learning framework.
To understand the power of automating a process like the one we just underwent, especially in the face of increasingly-large data sets, we can use an Excel Add-In called Solver. Solver is not a form of Machine Learning per se, but absolutely is a form of automation that makes use of optimization techniques on which many machine learning engines are built.
If you do not see Solver as an option in your Excel toolbar, you may have to quickly enable it manually by navigating to File > Options > Add-ins and then selecting “Excel Add-ins” next to “Manage” before hitting Go, and then checking the box next to the Solver Add-in
I have created a macro and assigned it to a button that will run Solver for us, but we can do this manually as well. Typically, one navigates to Data > Solver and then sees the following window:
We can set our objective as the American Odds w/ Juice for the New York Yankees, in this workbook located at H3 on the Master worksheet. We will set this to a ValueOf -135, which are the opening American odds in the bet365 market that we were optimizing toward. Finally, we will do this by changing variable cell D9, which is the weight being placed on the Monte Carlo simulation output.
The detail-oriented reader may ask what the deal is with the various solving methods available in the drop-down, currently occupied by “GRG Nonlinear” and that would be a fair curiosity to explore. The Solver Add-In actually comes with three approaches to optimization based on the nature of the problem being solved. Generally, the breakdown of use cases is as follows:
- GRG Nonlinear — Most commonly used for smooth, nonlinear problems; finds local optimum solutions.
- Simplex LP — Typically used for solving linear problems.
- Evolutionary — Used when solving complex, non-smooth, non-linear problems
For our purposes, with this particular data set and model, there will likely be little noticeable difference, and GRG Nonlinear should perform almost identically to Simplex LP. As we scale and build out our model, we may need to reassess this if we wish to continue using solver. Regardless, it should help illustrate the value of beginning to automate the process of optimization itself.
Machine Learning is a framework of which neural networks are a subset. Using them, we can train a model like the one we built, that is able to look across tens of thousands of matchups and factor in hundreds or thousands of different features, and autonomously iterate through the process to find a value for the weights for our two win probability models, and any other variables that don’t only produce the desired output for this event, but for as many as possible within a decision space.
Machine Learning is quickly revolutionizing every aspect of predictive analytics and taking the gaming industry by storm. Not only are quantitative models used for setting odds, but also for risk management, KYC/AML controls, responsible gaming, and a host of other functions. These are all areas in which experience using AI toolkits combined with sports gaming industry knowledge are growing increasingly valuable.
A Note to Fans & Casual Bettors
Over the next few years, a quickly-growing number of sports fans will want analytics and research to play an increasingly-large role in their sports betting, fantasy sports, and other related behaviors. Using a data-driven approach not only allows for the opportunity to identify profitable patterns, but is also the single most powerful tool in remaining disciplined and keeping emotion out of the decision-making process.
What anyone who has already attempted this will tell you is that it is very, very, very difficult. Even in the world of difficult math problems that quantitative analysts are often solving, creating a model that predicts sporting events with greater accuracy than those that have been refined over decades in Vegas, London, and elsewhere, is extremely hard to do. This, among other reasons, is why we should exercise extreme caution when someone is trying to advertise a system under which we pay them for the picks that their model produces. I discussed this issue in detail in an exposé I wrote on a particularly high-profile entity engaged in such a business, which has since been revealed to be a fairly spectacular failure, at least in part because of the difficulty we are discussing here.
A number of tech companies have set out to capitalize on the business opportunity presented by this dynamic, not only offering consumers a more accessible platform for conducting such research, but also providing sportsbook operators with mechanisms for connecting with prospective customers who are much farther down the conversion funnel than your average sports fan.
The coming decade should be a time of incredible growth, investment, and innovation in the US sports gaming market, and analytics are likely to permeate every corner of that trajectory.