In the past few months, I took a class in Data Science through General Assembly, a coding academy. We primarily coded in Python, with extensive use of Python libraries designed for mathematical computation & statistical analysis (Numpy, MatPlotLib, Pandas, and others). I wanted to write this article to detail the motivations, methods, & findings of a data science project I created as part this class, in a way that was more accessible to a non-technical audience.
For our final project, we were directed to use the tools we developed over the course to come up with a problem which could be addressed using a machine learning model.
I will go through my workflow throughout this project, starting with stating the problem I set out to address, looking at my method of acquiring the data, moving through how I structured my dataset, and ending with an explanation of the models I created and an interpretation of their results. I’ll conclude with some lessons I learned in the process.
I’m a big fan of NBA basketball. The idea for this project occurred when trying to think of a way to use basketball statistics in a machine-learning context. I initially thought of using box score statistics from previous games to make a prediction as to whether a particular team would win or lose. I am also a big fan of sports betting (in theory, less so in practice), so I thought it would be more interesting to try and make a prediction using box score statistics, with the idea I could use that prediction to inform a bet.
My decision for the ultimate direction of the project was to use NBA box score statistics from previous games to train a logistic regression model. This model would return a prediction as to whether a game’s total score would be Over | Under a point total set by a bookkeeper. The hope was I could use the predictions over the course of an NBA season to inform a long-term betting strategy.
For every NBA game, a sports book will set a point total, which is the book’s prediction of the final score. The sports book challenges bettors to bet either an OVER, if the bettor believes the total final score of the game will be higher than the total set by the book, or UNDER, if the bettor believes the total final score of the game will be lower than the total set by the book. If the game’s total final score is equal to the point total set by the book, that’s called a PUSH, and the book returns the bet. (For a more detailed explanation of the sports betting context behind my project, please see my Technical Report)
After figuring out the outline of my project, my next step was finding a reliable source of data I could work from. After some research, I found two websites that could give me the data I needed: basketball-reference.com, for data on NBA games and statistics, and sportsbookreview.com, for data on the corresponding betting lines. To get the information I needed, I scraped the above websites for box scores for every game for the past 5 NBA seasons, as well as the corresponding point totals set for every game in that period.
Structuring the Data
My goal was to return an Over | Under prediction for an NBA game, based on the box score statistics for the previous games. To do this, I structured my dataset in a way that each individual game was represented as the box scores for the 3 prior games for each team, for a total of 6 prior games. I made a small example below for reference:
A: Team 1
B: Team 2
#: represents number of games back from the current game
Current Game: A vs B
Represented As: 1 A vs O, 2 A vs O, 3 A vs O, 1 B vs O, 2 B vs O, 3 B vs O
By setting up my data up in this format, I hoped to create a model which would pick up on a relationship between the prior 3 games of box score statistics and the total score of an NBA game. In addition, I thought there may be bias on the part of the bookmakers in terms of adjusting the point totals based on the performance of the teams’ previous games; bias that my model could detect.
The process of taking this dataset and drawing insight from it involves creating a model. (See Glossary) The specifics of the model and how I created it are fairly technical, so I’m going to refer people who are interested in those details to the Modeling section of my Technical Report. At the end of the process I had created 2 models, one which returned a prediction on whether an NBA game would be OVER a set line or not, and one which returned a prediction on whether an NBA game would be UNDER a set line or not.
With the predictions from my models, I could analyze how often my model returned correct predictions. My models, making a prediction on all the NBA games in a season, were not significantly better than the average in determining whether a game would be Over | Under. Therefore, I set up what I termed a confidence threshold; my model returned probabilities for the chance of a game being Over | Under, and if the probability of a prediction was above the threshold I set, the prediction was “confident.”
Here is a breakdown of my results, with a confidence threshold of 62%:
- Predicted confidently: 88 games
- Predicted correctly: 52 games
- Accuracy: 59.09%
- Predicted confidently: 96 games
- Predicted correctly: 52 games
- Accuracy: 54.16%
I set the threshold at 62% because that point optimized both prediction accuracy and the # of games that were predicted “confidently”
My model did better in accurately predicting games that were Over as opposed to those that were Under, for the 2018 NBA season.
I have a guess for this phenomenon:
- Basketball point totals have been increasing over seasons. Looking at the past 5 years of NBA games, the average score has increased by ~ 5 points per team.
- My Over model may be picking up on some bias present in the way bookmakers are setting the lines. Perhaps books are seeing teams score high numbers of points, and are adjusting the point totals downward to reflect a belief that the teams will not score as highly as in future games.
I created a basic simulation, combining the predictions for each game in the 2018 NBA season. Starting with $10,000, and making a bet each time my models predicted a confident bet (as above, set at the threshold of > 62% probability), I show how my model performs informing a betting strategy over the course of the season:
The final total for the betting strategy informed by my model’s predictions was $11,880, with an accuracy of 56.52% on the bets that were predicted confidently.
- Working with time series data was more difficult than I expected. Because I chose to represent an NBA game as 6 different sets of statistics, I had to coordinate a large amount of data along a time dimension. One of our instructors said 90% of the work in data science is cleaning and arranging data, and that was certainly true for this project.
- Although my models were successful in returning enough accurate predictions to return a profit, I would not use these model to inform a long-term betting strategy, or I would maybe use the OVER model I created in conjunction with other strategies. My reluctance stems from the limited # of games the models were able to predict confidently. Also the fact that I had to manually target the level at which the models were maximally effective is a caveat to my results.
- I had a good time working with sports betting data because the results were visible, and I could show clear benefit to the work I did.
I recognize the concepts present in the Data Science portions of my project can be difficult to grasp, so I’m breaking down some of these concepts and how they’re used in my project here.
- Data Science — The interdisciplinary practice of using scientific methods, algorithms, and systems to draw insights from data. Combines elements of computer programming, statistics, mathematics, machine learning, as well as domain-specific knowledge.
- Web Scraping — The process of writing code to pull data off of a website. For a look at the python code I used to scrape the data, see my Web Scraping Notebook
- Over | Under — The bar between the word over and under can be read as “Over or Under”
- Model — A model is a system of making a determination about the value of an unknowable data point, using information you have available about the characteristics of other data points.
- Logistic Regression — The process of determining a relationship between a set of data points, where each data point falls into one of two categories. This relationship can then be used to make predictions on the category of additional data points. Can be thought of as predicting classification into one of two categories (in this case, classifying each game as either OVER or UNDER)