Machine Learning in Fantasy Basketball: Data Collection Using BeautifulSoup

Samuel Mohebban
7 min readJun 2, 2020
Applies to FanDuel/DraftKings/Yahoo (Image)

This project is a great start for beginners as it was the first machine learning project I ever did. Completing this project allowed me to grasp the fundamentals of BeautifulSoup, Pandas, NumPy, and Scikit-Learn — all of which are extremely important in data science. Before I go into the code, it is important to first understand the goals of the projects, as well as the data required to accomplish them. The architecture of this project is inspired by a Stanford article; which I recommend reading to get a better understanding as you attempt to do it yourself.

In every machine learning application, there are at least two critical phases: (1) Training and (2) Testing. The most common way to accomplish both is to first gather data, then split that data into a test and training set. In the training test, the model will be given both inputs and outputs which will be used to form predicitions on data that is unlabeled or contains no output. If this sounds confusing to you its okay, I will go over this concept in more depth in the next few parts.

When I first started this project, data collection was the hardest part. I searched so many websites to see if I could find multiple seasons worth of NBA scores and was surprised to see how complicated it could be. In the article I linked above, the authors mention they scraped their data from ESPN.com, which I could not do because the website changed completely since the article publication. I then found other options that offered fixed pricing for labeled CSV files- but paying for my data seemed like the lazy way out. I finally came across https://www.basketball-reference.com, which offers scores dating back to as long as 20 years ago!

***Code in this article can be found on my GitHub***

Python Libraries

  • Beautiful Soup (pip install bs4)
  • Pandas (pip install pandas)
  • Requests

When thinking about how much data I could scrape, I considered how far back do I have to go to determine if a player will be good today? Using the Stanford article as a reference, I ended up at two seasons. The reason I chose two is because although many players are stars throughout their career in the NBA, daily fantasy basketball is targeted more towards picking the wildcards that are inexpensive and score a lot of points. Moreover, sometimes the most valuable players are those that do not play much at all, so too much data could actually hurt these types of predictions. If you play fantasy basketball then you know what I am talking about, otherwise, this article is probably putting you sleep.

The code I implement below is meant to scrape a single season, so if you wish to scrape more than one season just run the code below with different initializing URLs.

So we are on the same page throughout the tutorial, I started my scraping on this url. As you can see, this page shows a chart of every game and links the box score for each. On the top of the chart, the page also features each month of the season where the same box score format is repeated (see pictures below).

Example of season format (link)
Example of box scores for each game (link)

The way we will scrape the website is as follows. First, we will create a list that houses the links for each month of the season. For example, the first link will be the October page, the second will be the November page, and so on. After we have a list of each month’s url, we will iterate through the list and visit each page. So the first visit will be the page pictured above for the month of October. On each month page, we will then iterate through every box score and save the url in another array. After we have the array of every single box score for the entire season, we will visit each url and scrape the data for the game. The reason I first saved all of the box score urls in a seperate array is because you will find that scraping each players scores for each game across an entire season can take a very long time depending on your computer (25,000+ data points per season). So by saving all the links to an array first, you can stop the program if you need then start it up at the same index at which you stopped. You will see what I mean later-on.

Before we begin the main script, create a file called mVariables.py. In the file, copy and paste the code snippet below. This portion is just used for formatting of the data so it is cleaner when it is saved in the csv.

Variables/Dictionaries for formatting (mVariables.py)

Now that we have created a mVariables.py file with all the necessary dictionaries, create another file called mScraper.py. On top of the file, import the following:

Imports/Dependencies (mScraper.py)

After we have the dependencies imported, it is now time to move onto the actual script. To make it easier to follow, I decided to condense the entire process into a single function. Of course you could separate the function into numerous different one and apply pool processing but I just want to keep it simple for the sake of the tutorial.

Month Urls (mScraper.py)

In the snippet above, we created a function that takes a base url. For our example, this URL will be the October page for the 2018–2019 season. If you wish to scrape another season just feed it the url for the first month in that season. The function then creates a month_link_array which houses the url link to each month, which we will then iterate through (code snippet below) to collect each box score url.

Gather Box Score URLs (mScraper.py)

In the snippet above, we created an array called box_link_array, which houses all the links to each box score for the entire season. In the next part, we will again iterate through each box score and scrape the game data on each (this is the part that takes quite a bit of time).

Vsiting each box score link and extracting the game data for each team (mScraper.py)

Here, we separated each player within a game to a separate array, then added each element of the array (box score) to a pandas data frame. Because we scraped the advanced stats on the website, these are the columns that we focused on:

Columns:

CSV Columns

Once you have that down your scraper is complete! On the same file, call the season_stats_getter(url) function on the bottom of the page as shown below. We added an enumeration for loop above so that you can track the position in the array in case you need to pause and restart later. I have used this script to successfuly scrape numerous seasons but just in case, I would recommend adding a try and except statement to protect against a hard stop.

Start the script (mScraper.py)

In the terminal, navigate to the project directory and type in python3 mScraper.py and press enter.

Terminal Example
CSV File (Kaggle)

And Voila! You have created a web scraper that can handle any season on https://basketball-reference.com. After you successfully run the program for a single season, you will see that that the resulting csv file can consist of 25,000+ data points. So, between two seasons, 50,000 data points should be sufficient for training our ML algorithm.

Although the function is pretty long, when you split it up into smaller parts it is not as complicated as it seems. There are surely many ways to make it faster and more memory efficient — so please comment below and share your optimization practices!

This section of the tutorial was not ML heavy, however, it is considered to be one of the most important steps in developing a reliable model. In the next steps of this project, I will use this data to implement regression algorithms to predict which players will score the most fantasy points on a given night. After, we will explore constraint satisfaction (CSP) to determine a lineup consisting of the best possible combination of players on a given night.

Check out my GitHub to see the code for this project! You can also visit my Kaggle to access the full csv files I scraped from my local computer as well as other data sets I have used in different projects.

--

--

Samuel Mohebban

Data Scientist | Senior Machine Learning Engineer | BA George Washington University | MS Stevens Institute of Technology