Predicting Basketball Results Using Python and Docker

Andy Lee
Analytics Vidhya
Published in
8 min readJul 3, 2020

Introduction

We’re going to build a basic predictive model in Python which calculates the winning probabilities between two teams in a basketball match. In order to verify its accuracy, we’ll be comparing the model’s output to actual bookmaker odds. This model is going to be a simple REST API which runs inside a Docker container (I mean, it is 2020 after all).

Just in case you don’t already know, here’s an overview of the tech stack for this project, not at all nicked from Wikipedia:

Python is an interpreted, high-level, general-purpose programming language. It’s designed to be easy to read and simple to implement.

Docker is a set of platform as a service (PaaS) products that uses OS-level virtualization to deliver software in packages called containers.

Docker Setup

Having the Docker engine on your machine is a prerequisite for this project, so if you haven’t already done so you’ll have to install and setup Docker on your machine. The instructions will be slightly different depending on which operating system you’re running on, so do give the documentation a good read when going through the install and setup.

Once you’re done with the process, open your terminal and type

docker --version

and if it returns something other than your Docker version, well… you’ll have to go through the setup process again because something probably went wrong.

After Docker is installed, you’ll want to create your working directory. I’ve called mine ‘python-basketball’ but of course you can call it anything you want. Now in order for Docker to work, we need to create two files in this directory: a Dockerfile (note there is no file extension at the end) and docker-compose.yml file. I won’t go into too much detail on what these are, but essentially a Dockerfile is a simple text file that contains the commands a user could call to assemble an image whereas Docker Compose is a tool for defining and running multi-container Docker applications.

The Dockerfile setup is pretty simple, and we can assemble our image in the same way as the official Python page from the Docker Hub repository - just changing the name of the run script. The file should look like this:

FROM python:3WORKDIR /usr/src/appCOPY requirements.txt ./RUN pip install — no-cache-dir -r requirements.txtCOPY . .CMD [ “python”, “./run.py” ]

Our docker-compose.yml will look like this:

version: '3.4'
services:
probability-calculator:
build: .
volumes:
- '.:/usr/src/app'

Finally, create a run.py file in the same directory with a simple hello world output:

print("Hello World")

Oldest trick in the book. Anyway, that’s all you actually need to run a Python program in Docker. If you go back to your terminal and type

docker-compose up 

Et voila, you should get a lovely “Hello World” printed to your console!

You’ve managed to run an application inside of a Docker container, without the hassle of having to actually install Python (or any of its tools and dependencies), and minus the overheads of managing a virtual machine.

Cool innit?

Let’s Actually Build The Model

OK, let’s get started and actually write some code.

We’re going to be using a few packages to help us in this project:

  1. Flask, a micro-framework to help build web applications.
  2. SciPy, a Python-based ecosystem of open-source software for mathematics, science, and engineering.

Normally we would need to use pip to install these packages manually in Python- however as we’re using Docker, all we need to do is add them as requirements and they’ll install automatically when the Docker image is built. Create a file called requirements.txt in your working directory. Add the following lines to the file:

Flask==0.12.2
scipy==1.4.1
flask-restful==0.3.6

Now we need to create a folder which represents the package holding our Python code for the model. I’ve called this folder ‘probability_calculator’. In order that Python knows that this is a package and not just a random folder, we need to create a file called __init__.py. To start off this file add the following:

#import the SciPy framework
import scipy.stats as sp
#import the Flask framework
from flask import Flask, g
from flask_restful import Resource, Api, reqparse
#create an instance of Flask
app = Flask(__name__)
#create the API
api = Api(app)

The Model

Right, now we’re getting into the meat of the code, which is building the actual model. This model is based on Mathletics by Wayne Winston- he explains that in the NBA, the historical standard deviation of match results about a prediction from a rating system is 12 points. In other words, the likelihood of the actual margin between two teams in any given match can be described by a bell-shaped probability distribution centred on the pre-match spread. Where would we get these spreads? Well, the online gambling boom means we’re able to get them fairly easily from online bookmakers.

Thanks to Covid-19, there are no NBA matches until the end of July. We’ll have to make do with the next best thing- the Chinese Basketball Association! Here are Pinnacle’s prices from Wednesday, 24th June 2020:

Pinnacle Odds for Chinese Basketball Association Matches

We have four matches, all of which have spreads (the “Handicap”). Let’s take the first match as an example- In the first match, We have Shanghai Sharks and Liaoning Flying Leopards separated by a spread of 10.5 points, meaning the Flying Leopards need to beat the Sharks by 11 points or more for the bet to win.

The “Money Line” is the probability of either team winning. Given the 10.5 point spread in favour of the Flying Leopards above, Pinnacle have given the Sharks decimal odds of 4.92, which means they have a 20.3% chance of winning. The Flying Leopards have decimal odds 1.161, which translates to a 86.1% chance of winning.

Just a minute, these probabilities sum up to 106.3% - aren’t probabilities meant to sum to 100%? True, but bookmakers make their money off the overround, which is basically the percentage they build into selections to make it worthwhile for them to lay a bet. Just to make it easy for us, we’re going to use secondary school maths to normalise these percentages by ratios. We’ll end up with the following win probabilities for each team:

Shanghai Sharks: 19.09%
Liaoning Flying Leopards: 80.91%

Phew, these Chinese names are a mouthful. Where’s a crisp NBA team name like the “Minnesota Timberwolves” when you need one?

Back To The Code

Now that we know the spread and winning probabilities that the bookmakers have given these two teams, it’s time to get back to work in Python.

We can hard-code the standard deviation to be 12.0 (it’s probably slightly different in the Chinese League than the NBA, but let’s leave that for another day). Next, we need to generate the winning probabilities. These will be based off a normal distribution (also known as a Gaussian Distribution) centred around the mean of the bookmaker’s spread, together with the given standard deviation. Get back to __init__.py and let’s write up the model:

STD_DEV = 12.0def calculate_probability(spread, std_dev):
return sp.norm.sf(0.5,spread, std_dev)

This function just returns the probability that a team wins by 1 or more points given the bookmaker spread and standard deviation. You can read up on this in SciPy’s documentation on the normal distribution.

Next up, the actual model request. It’s going to be a HTTP POST request which makes use of flask-resftul, an extension of the Flask package which provides tools to build an API quickly. The way it works is this:

  1. User creates a class for each endpoint
  2. User then creates a function for each method to be accepted

With that in mind, here’s the request coded up with a 200 status code and message if successful:

class WinProbability(Resource):def post(self):
parser = reqparse.RequestParser()
parser.add_argument('spread', required=True, type=str)
args = parser.parse_args() favourite_win = calculate_probability(float(args.spread),
STD_DEV)
outsider_win = 1 - favourite_win
return{'message': 'POST really was excellent',
'favouriteProbability': favourite_win, 'outsiderProbability':
outsider_win}, 200

That really is excellent.

Last bits before we test the model and response, don’t forget to add this resource at the end of the file:

api.add_resource(WinProbability, '/winProbability')

And now let’s actually remove the “Hello World” child’s play from the run.py file and replace it with our application instead:

from probability_calculator import appapp.run(host='0.0.0.0', port =80, debug=True)

Finally let’s expose port 5000 for this application in docker-compose.yml at the end:

ports:
- '5000:80'

And now we’re good to go!

Testing the Model

First off, we’ll need a REST client to test our HTTP requests- I’m using Postman for this but you can use any that you like.

Enter the project directory, fire up the terminal, and type

docker-compose up

to start the app.

Now using our REST client let’s hit port 5000 with a POST request with content-type set to application/json and the body below:

{"spread": 10.5}

If all goes well, you should get this response:

{"message": "POST really was excellent","favouriteProbability": 0.7976716190363569,"outsiderProbability": 0.20232838096364314}

But how do we verify that these probabilities are, well, actually useful?

We’re going to compare them against Pinnacle’s match result market from earlier. Going back to the win probabilities we recorded earlier for the Shanghai Sharks and Lianoning Flying Tigers, we can see that we had the Sharks at 19.09% and Flying Tigers at 80.91%. That’s not too far away from 20.2% and 79.8%. Nice! I guess we can start our own bookmaking business now, right?

Not so fast. First of all we need to quantify how far away from the expected probabilities our model’s outputs actually are. There are several ways to do this, but a nice way would be to use the mean squared error (MSE) of the sample data. In the interest of time and as it is slightly off-topic, the sum of squared errors (SSE) is exactly that- the square of all the model’s errors when compared to actual data, summed. When you divide that sum by the amount of data points, you get the MSE. For this example, the squared error would be

Match 1 squared error = (19.09 - 20.24)² = 1.3225

We’ll have to run a our model against all four matches to compare the results and obtain the MSE. Now we could write another Python module for this, or we could do what the non-masochists would by this point and crack out an Excel spreadsheet:

SSE and MSE of Model Outputs

Note we only need to take four data points as the model was only run four times for the favourite’s probability. The outsider’s probability is just the complement, i.e. 1-Probability that favourite wins.

MSE from the dataset is 2.13- maybe we’re not quite ready to launch our own sportsbook business just yet.

Conclusion

Phew. We’ve come a long way from setting up Docker on our OS- we now have a basketball model built in Python, running on its own Docker container! That really is excellent.

Now the question is what can be done to improve the model? I can think of a few things already:

  1. Standard deviation of the dataset might not necessarily be 12.0
  2. There’s no way that a sample data of 4 is enough to justify any model
  3. Initial assumptions might be too simple- can you really predict basketball results off a Gaussian distribution? Might there be better ways of modelling basketball outcomes?
  4. Code error handling is non-existent (but also, life is short)

Otherwise, I think this is a good attempt at predictive modelling for basketball. If my blog post was rubbish and you would rather just rip my code, feel free to do so at my Github repository.

Thanks for today!

--

--