Intro to Scraping NBA Data with BeautifulSoup

Published in

Hardwood Convergence

11 min readJul 21, 2019

Thanks for being patient while we worked on setting up our environment. Now we’re jumping into coding and manipulating data. Here’s the plan for this post:

Set up a virtual environment
Install ipython, jupyter, beautifulsoup4, and pandas within the virtual environment
Learn to scrape, organize, and save web data

Setting Up A Virtual Environment

Why are we doing this? Virtual environments are just isolated environments for python projects. They allow you to use different versions of modules for different projects without causing conflicts between the versions. Let’s start by downloading virtualenv. You can do this by opening terminal and entering:

pip install virtualenv

Now, navigate to the medium_tutorials folder and we’ll activate a new virtual environment. We do this by typing in

virtualenv .venv

Now that the virtual environment is created, we enter with the following command:

source .venv/bin/activateor
 
. .venv/bin/activate

You can see that when we activate the environment, it’ll show (.venv) at the beginning of the command line. Easy peasy.

Install Packages in the Virtual Environment

The packages that you’ve installed or came with your python distribution typically aren’t available in your new virtual environment. Let’s make our new space functional by installing ipython in our virtual environment:

pip install ipython

Install ipython — Installing ipython in our virtual environment

Now we can install Jupyter so we can start using notebooks:

pip install jupyter

With Jupyter is installed, we can install a Jupyter kernel in our virtual environment:

ipython kernel install --user --name=.venv

Finally, let’s install BeautifulSoup4, requests, and pandas so we can start scraping data and perform some basic calculations:

pip install beautifulsoup4 requests pandas

Time to Start Coding!

https://media.giphy.com/media/5VKbvrjxpVJCM/giphy-downsized.gif

Notebook Setup

I’m going to start by creating a new folder called web_scraping in the project folder. I’ll change into that directory and then start jupyter notebooks by entering into the terminal:

jupyter notebook

This will start your browser with a menu with text stating the notebook list is empty. Let’s change that by clicking “new” in the top right corner and then .venv under Notebook.

Create notebook using virtual environment kernel

Now we have a blank notebook using our virtual environment. Let’s do some quick documentation before we write our code. First, we’ll click on the name “Untitled” at the top of the screen to rename our notebook. We’ll call this web_scraping_tutorial.

Next, I like to always make the top cell of the notebook to be a quick description about what the notebook does. We can write this in Markdown so that it formats nicely. To get into markdown mode, you can select “Markdown” from the dropdown menu next to the word “Code” or select the cell and hit control+m, m.

If you’ve never worked with markdown, it allows you to easily format text using basic commands. Instead of spending time on it, read here if you want to learn the basic syntax. Otherwise, just follow my lead and you’ll be fine.

I entered some initial description text. You can see the markdown version on the left and the resulting formatting on the rights:

Importing Modules

Now let’s import the modules we plan to use. We’ll import requests to pull html, BeautifulSoup4 for parsing, pandas so we can manipulate data tables, and os to manipulate directories when saving data.

If you’re not familiar with python basics or importing modules, now is definitely the time to go back and do some tutorials. I linked these in previous posts, but I highly recommend Sentdex’s Youtube tutorials and treehouse.

Anyway, here’s what the imports should look like:

Requesting Data from Basketball Reference

We need to understand the basics of html to scrape data off the web. If you are somewhat familiar html, then you’re good to go. If you haven’t seen html before or need a quick refresher, this 12 minute video should teach you enough to understand what we’re doing:

HTML refresher by Jake Wright

Now we just need to find a page that we want to pull data from. Let’s go with James Harden’s 2018–19 game logs on basketball reference. Click on that link and copy the url. We’ll need this for our code. In your next cell, type the following code:

url = 'https://www.basketball-reference.com/players/h/hardeja01/gamelog/2019/'
page = requests.get(url)
page

This code saves our url text as a variable named url. We then passed that url into the requests’ library get function and saved the response as page. Calling page on the last line will give us the server response. It should print <Reponse [200]>, which is a successful server response.

You don’t have to follow along with these next few parts, but if you check out the screenshot below you’ll see when I call page.content it returns a messy string of data starting with b’. Using the type function in python, we see that this is returning bytes and those bytes are very messy. Let’s start using BeautifulSoup to clean this up and make those bytes useful.

Messy Data — We have data and it’s messy!

Cleanup with BeautifulSoup

We already imported BeautifulSoup in our code, so now we just have to pass our page content to it. We’ll do that with the following code:

soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

Here, we’re passing our messy page.content data into BeautifulSoup and using the html parser. We save this output as soup. We then just prettify our soup and print the output. You can always check out the raw soup variable as well. It’s the same data as the prettified version, but just less pretty.

We have the content parsed in BeautifulSoup, but how do we know where to find the data we need? We can scroll through the html text, or we can do a simple cmd-f to find a specific value from the game logs page in the html. Harden played 34:43 in the first game of the season, that’s pretty unique, so let’s search for that value in the html.

Searching for values is easier than scrolling forever

The first thing we notice is that the stat is contained in a table data tag. Also, you see a data-stat element with a value of “mp”. Looking at the next td tag you see another data-stat of “fg”. If you search through the rest of these tags, you’ll see that they correspond with the columns of the online table.

Ok, fair warning- the next piece of script isn’t the most optimized, but what I’m going to do is create a list for all the data-stat values we want to pull. A better way to do this could be to actually pull those elements from the html, but we’re not worrying about that right now. Once we have the stats in a list, we’re going to use a nested list comprehension to pull all James Harden’s stats. Here’s the code:

stats = ['game_season', 'date_game', 'age', 'team_id', 'game_location', 'opp_id', 'game_result','gs', 'mp', 'fg',
'fga', 'fg_pct', 'fg3', 'fg3a', 'fg3_pct', 'ft', 'fta', 'ft_pct', 'orb','drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf', 'pts', 'game_score', 'plus_minus']stats_list = [[td.getText() for td in soup.findAll('td', {'data-stat': stat})] for stat in stats]

Let’s break down that stats_list variable so we’re all clear on what’s happening there. When you learn loops in python, they typically look something like this:

list_in = [1,2,3]
list_out = []for i in list_x:
    list_out.append(i+1)

With a list comprehension we can perform this same action while keeping our code bit cleaner and shorter:

list_in = [1,2,3]
list_out = [i + 1 for i in list_in]

It’s cleaner because it’s telling us the action right away and it’s typically faster. As a general rule of thumb, use list comprehensions instead of for loops if they’re simple to understand. If you need more than one nested list comprehension or advanced logic, use the slower if statement with cleaner code. Anyway, with our code above we’ll start with the inner list comprehension:

[td.getText() for td in soup.findAll('td', {'data-stat': stat})]

This code is saying get the text value of the table data element for every table data element that is found when we search our soup text and find all table data elements that have a data-stat element equal to our stat value. Since the stat value isn’t defined, this is where the outer list comes into play. We’ll use pseudocode here:

[[inner list] for stat in stats]

We know the inner code is getting the table data value where data-stat equals stat. This code is just saying to run this process for every stat in our list of stats. After running this code, we can checkout our stats_list variable see the following:

From Nested List to a Pandas DataFrame

With Pandas we can easily put our data into a DataFrame, which makes it easy to view and manipulate the data.

Let’s start by trying to create a DataFrame directly from stats_list and view the first 5 rows:

pd.DataFrame(stats_list).head(5)

Initial test load into a dataframe — Initial test load into a DataFrame

You can immediately tell that this data doesn’t look quite right. In fact, we see that we have 82 columns. The data isn’t oriented correctly for our purposes. Let’s switch the transpose the data and check the first 10 rows.

pd.DataFrame(stats_list).T.head(10)

On first glance this looks great, but check out rows 4–6. You see there is no data loaded in the 0 column. When we view this against the game logs on basketball-reference, you see that Harden was inactive for these games. Despite Harden being inactive, our DataFrame is showing stats.

The issue here stems from the html structure of the data. When a player is inactive, basketball-reference only sends over the value of inactive once and it spans across all columns after the result column. We need to shift around some data and recreate our dataframes.

Fixing Our Data Pull and Finalizing the DataFrame

If you look at the game logs, you’ll notice that the data is complete, regardless of whether or not the player is inactive, for the game (g) through result columns. The columns from game starts to plus-minus are only filled in if the player is active. Let’s re-pull the data into separate lists based on this differentiator.

stats_left = [[td.getText() for td in soup.findAll('td', {'data-stat': stat})] for stat in stats[:7]]
stats_right = [[td.getText() for td in soup.findAll('td', {'data-stat': stat})] for stat in stats[7:]]

This is the same code we used above for our stats_list variable, but instead of looping through every stat, the stats_left variable looks at the first 7 stats, and stats right loops through the rest.

Now we’ll put the left values into a pandas DataFrame and view the top 5 rows to make sure it worked correctly.

df_left = pd.DataFrame(stats_left).T
df_left.columns = stats[:7]
df_left.head(5)

That looks good. Now we just need to write a quick piece of code to insert blanks when Harden was inactive. Let’s try this:

for i in range(len(df_left)):
    if df_left['game_season'][i]=="":
        [stats_right[x].insert(i, '') for x in range(len(stats_right))]

Let’s break this down to make sure it’s clear. This is definitely a time when using a list comprehension would have made our code unnecessarily hard to read, so we used a regular loop. The first line is just stating that we’re going to take the length of the df_left and pass that into a range function. We know the length is 82 due to an 82 game season. So for i in range 0–81, we are going to look at the game_season column of df_left and see if the i-th column is blank. If it is, we’re going to run another list comprehension. We’ll look at this one by itself to make sure it is clear:

[stats_right[x].insert(i, '') for x in range(len(stats_right))]

Remember that stats_right is a nested list. So we’re going to to insert a blank value in stats_right at index value i (i.e. the same index value where df_left[‘game_season’] is blank) for every stats for in stats_right.

Now, let’s put these values for the right side of the data in their own DataFrame and then we’ll combine the two DataFrames:

df_right = pd.DataFrame(stats_right).T
df_right.columns = stats[7:]
df = pd.concat([df_left, df_right], axis=1)

Completed DataFrame with valid data. BOOYEAH! — Completed DataFrame with valid data.

Now we have a working DataFrame! Let’s do a simple test to make sure the numbers make sense. Let’s just calculate Harden’s average points per game:

df[df.game_season!='']['pts'].astype(int).mean().round(1)

This is a bit funky. Since all the values are read as strings, we have to change the points to integers. However, we can’t change blanks to integers, so we only choose values where Harden played. Then we take the mean of the pts column and round it to one spot. The value: 36.1.

Saving the DataFrame

Our last step is going to be saving this dataframe. First, let’s make a new folder in our web_scraping folder called game_logs. We’ll navigate to that folder and then save our dataframe as a csv there.

os.chdir('./game_logs')
df.to_csv('2018_19_Harden_James.csv', index=False)

The first line just changes directory using a relative path from our current location and into the game_logs folder. The next line calls the to_csv() method on our DataFrame. We then named the output and told python not to include a column for the index.

Don’t want to write your own version of this code? Pick up this version on github here.

What’s Next

We covered a lot of ground in this tutorial. We learned how to set up a virtual environment, scrape a website, manipulate lists with list comprehensions, store data in a dataframe, and save that data in a csv. Next, we’ll work with Harden’s game log to calculate some metrics and create basic plots. After we get a handle on getting information from a small data set, we’ll move on to finding faster and easier ways to access more data and expand our analyses.

Things are going to pick up from here as we’ll be spending more time with the actual basketball data and less time on setting on our environment. Looking forward to it, hope you are too!

Thanks for checking this out. If you learned anything from this tutorial, please smash that clap button and share with a friend! 👏👏👏