How to predict the NBA with a Machine Learning system written in Python

Francisco Goitia
May 5, 2016 · 2 min read

Which sports geek wouldn’t like to create their own system for predicting matches, be it if you want to bet or just out of intellectual curiosity ?
Nowadays, advanced statistics are available on websites like basketball-reference and awesome machine learning libraries can be used for every programming language. This is not going to be a comprehensive DIY kind of guide, I’m just going to talk about what I found when playing with this for a few months and share some code that will be very useful for the kickoff.

Where do I get started ?

Machine Learning works by building models that capture weights and relationships between features from historical data and then use these models for predicting future outcomes. So, you need to understand the sport, think which variables are representative of future performance, build a database that contains this information and run Machine Learning algorithms on historical data to analytically assign weights to these variables.

Building your own database

I spent quite a long time building an NBA and NCAA scraper which downloads full seasons (match by match) from basketball-reference. All the problems you may stumble upon as regards relational databases are solved in my scraper and you are guaranteed to uniquely associate information.
My scraper models matches in a sophisticated json format that captures the advanced stuff that takes place in a basketball game.
For representing this information in your own database you will need to define a schema and insert the information. I used SQLAlchemy to write models that can be used to create the database and build an analytical system. It’s all available on my github repo.

Predicting Matches

Scikit-Learn is the way to go for building Machine Learning systems in Python. You will need to figure out which attributes work best for predicting future matches based on historical performance. As said before, understanding the sport allows you to choose more advanced metrics like Dean Oliver’s four factors. These, combined with other human analysis (like Vegas lines for example) work best.


If you build your own machine learning models you will find that you can correctly predict winners at a rate of around 70%. Not enough though to win money through betting, but still better than Espn experts and a lot of academic papers. You will also learn a lot about the sport, databases, machine learning and Python.

Part II

how hackers start their afternoons.

Francisco Goitia

Written by


how hackers start their afternoons.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade