Optimisation in Python to Reduce Mean Squared Error

Andy Lee
Analytics Vidhya
Published in
10 min readJul 15, 2020

Revision

Right, what state did we leave our model in?

Lebron James
Where were we, indeed?

If you recall, we were using bookmaker spread inputs for basketball matches to forecast teams’ winning probabilities. After an intense Docker build that left a carbon footprint ten times larger than Greta Thunberg’s worst enemies combined, we built a basic basketball model based on Wayne Winston’s Mathletics. From there we deduced that our model’s four-match Chinese Basketball League forecast had a mean-squared error (MSE) of 2.13 compared to the actual market output.

I glossed over the MSE metric in my previous post because it was off-topic, but in this post I’m going to go through some of its concepts.

What Is This MSE You Speak Of

As always, Wikipedia puts the definition of MSE into words far more efficiently than I would be able to. MSE is a risk function, computing the average of the squares of errors between an estimate and its actual values.

Alright mate, care to explain that in plain English though please?

…so we have our Python Basketball model, which we are using to forecast the win probabilities off the initial spreads. This is our estimator. The win probabilities that we obtained from the betting markets (in our case, Pinnacle Sports) are our observed values, in other words, values which are taken to be true (now, they may or may not be “true” but… let’s not worry too much about that at the moment).

Now stop and think about this for a second. How do we know that our estimator is actually, well, accurate? It’s easy enough to confirm accuracy if a model produces values that are no different to those observed. While true, this is basically saying that somebody’s built a model that perfectly captures reality. Is that really possible? Pretty unlikely I’d say. Much more realistic would be an acceptance that there are some costs- for instance assumptions that were too simple (or completely wrong)- associated with a model. And that’s what a risk function is- it computes this specific loss of information in the form of a single value to the user. The MSE is just a special case of this.

Make y=Mx+C Great Again

Graph illustrating observed and expected values
Graph illustrating observed and expected values

Let’s generalise a little. I was too lazy to derive my own graph from a charting library so I nicked this example from the excellent Free Code Camp. This is the infamous y=Mx+C. Hopefully you might have seen this in school, but in case you weren’t paying attention (or in case you dropped out to pursue a career much more successful than some bloke writing about MSE) then this is what’s represented in the graph:

  • Purple dots are points on the graph, each point has an x and y coordinate. These are your observed values
  • Blue line is the prediction line, covering the estimated values of the model
  • The red line between each purple point and the prediction line are the errors. Each error is the distance from the point to its predicted point.

And now let’s have a look at the MSE formula:

uhm…

It seems like we’ve gone on a tangent to a foreign language now, which is exactly what the strange “E” is in the above image- it’s the Greek symbol for sigma. This represents the sum of the sequence of all numbers from the first (i=1) to the last (i=n) . At each data point x (these would be the purple dots on the graph, and in our case, the Pinnacle win probabilities for each basketball match), we subtract its corresponding value x’ on our line (which are the model’s estimates of the win probabilities- note that the graph of our basketball model’s expected values isn’t a straight line, but don’t worry about this), and then square it. The MSE is a result of the sum of all these squares.

And that’s it for our GCSE refresher!

Alright, we’ve established that MSE computes the loss of information between actual and estimated value of a model. In an ideal world, our model would perfectly mimic real life, right? I know I said it wasn’t very likely earlier, but… given fair or sound enough assumptions, it is possible to minimise the MSE. We can get this value by way of solving for a model’s parameters, such that the MSE is as close as possible to zero.

So how do we go about implementing this solver? We won’t have to innit, cuz somebody has already done it for us in Python- which really is excellent.

Cheating

Finally! Time to dust off our code editor and get back to our model. You’ll be glad to know that we don’t need to change any of our Dockerfile build, as the python:3 image from Docker Hub that we used from my initial post is sufficient. Thankfully lower carbon emissions this time.

I’ve also refactored the API call from my last post (there was more boilerplate in there than a 1940s newspaper business). Don’t worry about that though, you’ll always be able to nick my GitHub code and do it better if I’ve written something disgraceful.

Here are the two new imports required for our solver:

  1. Numpy, a Python library which adds support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
  2. Optimize, a SciPy sub-package which provides several commonly used optimization algorithms.

Let’s get straight into the __init__.py file once more and add these two lines to the top:

import numpy as np
from scipy import stats, optimize

We’ve setup the API with Flask in the previous post so all we need to do is to code up the endpoint and implement the solver.

class Minimize(Resource):   def solve(self, spread, expected, std_dev):
spread = np.array(spread)
expected = np.array(expected)
def mse(s):
estimated = stats.norm.sf(0.5, spread, scale=s)
mse = np.sum(np.power((estimated - expected), 2)) / spread.size
return mse
optimal_solution = optimize.minimize(mse, std_dev, method = 'BFGS')
return optimal_solution

Whoa whoa, what’s going on here? As per the Flask framework, we’re creating a class called Minimize for this endpoint. But why np.array instead of a normal Python list? Well, NumPy arrays take significantly less amount of memory as compared to Python lists. It also provides a mechanism of specifying the data types of the contents, which allows further optimisation of the code.

Now we’re getting to the bit that matters. Remember our model’s inputs? The first was a list of spreads from Pinnacle (which we can’t change), and the second and only variable we have is the standard deviation, which we specified as 12.0. This value of 12.0 actually gave us an MSE of 0.00213 (right, I know I said 2.13 in the original post- I multiplied them by 100 to make things simpler), so we want to know if it’s possible to make the MSE smaller by changing this value. This is what optimize.minimize does for us- it’s solving for an optimal standard deviation. We’re using the BFGS algorithm, which I believe is the default if not selected.

Heavy industrial work
My machine’s memory at work every time I run a Docker build for a Python image

Cool, time to code up the POST request. We’ve defined three inputs that need to go in:

  1. The array of spreads
  2. The array of actual win probabilities
  3. Standard Deviation (any floating value will do, this will be the value that is optimised)

Add this bit within the Minimize class:

def post(self):    parser = reqparse.RequestParser()
parser.add_argument('spread', action='append', type=float, required=True)
parser.add_argument('expected', action='append', type=float, required=True)
parser.add_argument('std_dev', type=float, required=True)
args = parser.parse_args()
res = self.solve(**args)
return {"optimum": res.x[0],"fun": res.fun,"message": res.message}, HTTPStatus.OKapi.add_resource(Minimize, '/minimize')

Everything looks cool except maybe the “return” bit. Have a look at the OptimizeResult object in the SciPy documentation. There are plenty of attributes in the object, but essentially the results that matter to us are:

  1. The optimised standard deviation, which is the solution of the optimization. This is our “optimum”.
  2. Value of the objective function, which is basically the minimised MSE value. This is the “fun”.
  3. The “message” to confirm whether or not the operation terminated successfully.

Great! Let’s POST to the endpoint http://localhost:5000/minimize with the following inputs from our matches as the body:

{"expected":[0.8091, 0.7785, 0.7708, 0.7692],"spread":[10.5, 9.5, 10, 8.5],"std_dev":12.0}

And then let’s have a look at the response…

{"optimum": 11.708226529461706,"fun": 0.00018173060393236452,"message": "Optimization terminated successfully."}

Nice! So we’ve reduced MSE from 0.00213 to 0.00018, which is quite a substantial decrease. In the process, we’ve found the optimal standard deviation for our dataset to be 11.71. Speaking in layman’s terms, it means that the bookmaker spreads that we are using as predictors for the win probabilities work best when we input a standard deviation of 11.71.

So this is great, right? We can now go around grabbing spreads from bookmakers for our model, plugging 11.71 as a standard deviation and deriving the win probabilities from them, making ourselves rich in the process, right?

Stop Right Now, Thank You Very Much

Nah, we can’t really do that. Remember at the start of this post I alluded to a model “perfectly capturing reality”? Well, we have to ask ourselves a few questions about what “reality” means. First off, is the Chinese Basketball League a fair representation of the sport of basketball in general? Even if we compare it to the NBA alone, there are numerous rule differences and player standards. This immediately casts some doubt when blindly overfitting our model to match these data points. And this doesn’t even go into the fact that, as I mentioned in my first post, we only have 4 data points- and when you actually compare the estimates, the last two are significantly further from the observed values than the first two. It’s impossible to draw any concrete conclusions from such a small sample size!

Then we have another valid question. What makes Pinnacle’s win probabilities more acceptable as our model benchmark, as opposed to your mate down the pub’s opinions on the Chinese Basketball League- or anybody else for that matter? This is quite a complicated question, and the answer involves a theory that has split mathematicians and their interpretation of probability since the 18th century. In any case… it’s comforting to know that a substantial amount of the mathematics community absolutely does not agree with probability being interpreted as a degree of belief, hence rendering most of this work complete heresy!

Andrew Gelman, American statistician
This bloke does NOT approve of your probability models

There’s one final nail in the coffin to follow, and that’s the shortcomings of MSE itself. Like variance, MSE carries the disadvantage of heavily weighting outliers, a result of the squaring of each term. This effectively adds more weight to larger errors than smaller ones. Think of it along these lines: an error twice as big will be penalised four times as much. An error five times as big will be penalised twenty-five times as much. Not a desirable metric then really, is it?

Silver Lining

Don’t worry, you haven’t read this far for me to declare that the whole post was a waste (or at least, I hope it wasn’t). First off, MSE does indeed have its faults- but the truth is that there is no perfect error metric. The way you compute your loss of information when estimating a model’s values will depend on several things, most of them out of the scope of this post. If you are indeed interested to find out more about error estimates and the maths and theory behind them, and also which functions to use in specific cases, this post is pretty good.

Secondly, I don’t want to be that guy, but… remember the quote that “all models are wrong, but some are useful”? With regards to the utility of a simple model like this, it really depends what your end goal is. If you want to beat the NBA Vegas money lines and spreads then you’ll be sorely disappointed- millions of dollars from professional gamblers go through every line, much of which represent some pretty shrewd opinion and information. You can’t expect some bloke’s open source Python code on Medium to be better than these punters!

On the other hand, if there are other derived betting markets likes outrights or futures which require win probability estimates, I don’t see why you couldn’t use this model (in his book, Winston does exactly that to forecast the San Antonio Spurs vs Cleveland Cavaliers playoff final in 2007). If anything, you could just play around with the standard deviation on any particular matchday to see if the market’s win probabilities tally with the bookmaker spreads, which might give you some information as to where bettors and bookmakers are placing themselves in the market.

Ultimately, what it comes down to is this- having a model in any field helps inform your decisions better than someone without one. It’s up to you what you make of it, and how to improve its estimates- or even how to build a better one. I’m hoping this post has helped a little with that decision process.

Cheers for today!

--

--