Unmasking Basketball Using Python
Part 1: Scraping with Beautiful Soup
In this blog series, powerful Python libraries are leveraged to help uncover some hidden statistical truths in basketball. The first step in any data driven approach is to identify and collect the data needed.
Luckily for us, Basketball-Reference.com hosts pages of basketball data that can be easily scraped. The processes of this walkthrough can be easily applied to any number of their pages, but for this case, we plan on scraping seasonal statistics of multiple rookie classes.
The Objectives
- Identify the Data Source
- Download the Page
- Identify Important Page Elements
- Pre-Clean and Extract
- Archive
The Tools
- Requests package
- Beautiful Soup package
- Pandas package
Though we will inevitably be working with a multitude of specialized libraries throughout this project, the above packages will suffice for now.
Identifying the Data Source
Basketball-Reference.com hosts hundreds of curated pages on basketball statistics that range from seasonal averages of typical box scores categories like points, rebounds and shooting percentages, all the way down to the play-by-play action of each game played in the last 20 or so years. One can easily lose their way in this statistical tsunami if there isn’t a clear goal set on what exactly to look for.
The goal here in this post is simple: get rookie data that will help in assessing a young player’s true value and potential.
The following link is one such page. It lists all the relevant statistics of rookies in a particular season.
https://www.basketball-reference.com/leagues/NBA_1990_rookies-season-stats.html
In order to accumulate enough data to make solid statistical inferences on players, one year of data won’t cut it. There needs to be dozens of years worth of data collected to help filter through the noise and come to a conclusion on a player’s future potential.
If an action can be manually repeated, it makes itself a great candidate for automation. In this case, the number in the URL above corresponds to the respective year of that rookie class. Powered with that knowledge, let’s start putting together our first lines of code.
import requests
from bs4 import BeautifulSoup
import pandas as pd
years = list(range(1990, 2017))
url_base = "https://www.basketball-reference.com/leagues/NBA_{}_rookies-season-stats.html"
In creating the two variables referenced above, our thought process is as follows.
- The appropriate packages are imported
- The
years
list variable specifies the ranged of the desired years - 1990 up to 2017 url_base
serves to store the pre-formatted string variable of the target URL
Downloading the Page Data
In scraping web pages, it’s imperative to remove as much overhead as possible. Seeing as the site stores all their information on the HTML front end, the page can be easily downloaded and locally stored in its entirety.
#loop iterates through years
for year in years:
url = url_base.format(year)
data = requests.get(url)
#each page year is locally stored as an HTML file
with open("notebooks/Rookies/{}.html".format(year), "w+") as f:
f.write(data. Text)
for
loop iterates through the list variableyears
- The curly braces found within the
url
’s string will serve to allow theformat
to substitute it with the currently iterated year
— For example, in its first iteration, theurl
value will be https://www.basketball-reference.com/leagues/NBA_1990_rookies-season-stats.html
— On its second iteration, the subsequent year would be referenced instead: https://www.basketball-reference.com/leagues/NBA_1991_rookies-season-stats.html - The data variable acts as a placeholder for the
requests.get()
function and references of the currently iteratedurl
string value - The
requests
method then uses the newly formatted URL string to retrieve the page in question - The subsequent
with open()
reads and writes (w+
) the page data from ourrequests.get
(data.text
), and locally stores the newly created HTML files
Why download the page and store it locally?
To avoid a common growing pain in site scraping, we store these pages as local HTML files. See, when making a visit to a page site, the server hosting said page honors your request and sends back the appropriate data to your browser. But having one specific client asking for the same information over and over puts undue strain on the server. The server admin is well within their rights to block these persistent requests for the sake of being able to optimally provide this service to others online.
By downloading these HTML files on your local machine, you avoid two things:
- Having to wait longer than usual to collect the same data
- Being blocked from visiting the page, halting data collection altogether
Identifying Important Page Elements
There are a few advantages in scraping data from HTML formats. One of the most important is how element properties can be uniquely identified. If they can be uniquely identified, they can be scraped using Beautiful Soup. But to do that, one must know how to inspect these HTML elements themselves.
How to Inspect
We’ll need to dive deeper into the inner workings of this document, but I promise I won’t make this an exercise on learning HTML.
If you know how to inspect HTML objects, feel free to jump ahead. Otherwise, please follow along on how to inspect page elements.
- Click on the three vertical dots on Chrome’s top menu bar
- Choose “More tools”
- Select Developer tools.
Option 1: Developer Tools
- Right-click on the web page
- Choose “Inspect” to access the Developer tools panel
Option 2: Menu Select
Inspecting the Page
Seeing that all of these pages are locally stored, we can choose to view them by either going into the file system to open them in our desired browser, or we can continue to build our code by implementing the following snippet of code.
with open("notebooks/Rookies/2000.html") as f:
page = f.read()
Below is the loaded page with Developer Tools docked to the right. Notice how hovering the mouse cursor on the HTML line containing the class ID rookies
highlights the table element on the page?
All the desired data of this page is housed in that table element. Before hastily sucking up all of this data as is, now is the best time to consider whether everything on this table is worth collecting.
Pre-Clean
Pre-cleaning might not be a frequent word in your vocabulary, but for those of you seeing yourself scraping data on a regular basis, it should be. If you want to avoid the frustration of wasted hours of progress on a data collection project, it’s best to first separate the chaff from the wheat.
For instance, take note of the three elements boxed in red.
One row serves as the “main” table header. The other two rows are duplicate instances of the same artifacts found at the top. This pattern repeats every 20th row.
Further inspection reveals that all of these rows have the same tr
(table row) HTML tag. What distinguishes each of these elements from any others are their class names.
- Main Header Row
— Class =over_header
- Repeat Header Rows
— Class =over_header thead
- Statistics Category Row
— Class =thead
#array to house list of dataframes
dfs = []
#unnecessary table rows to be removed
classes = ["over_header", "over_header thead", "thead"]
dfs
will be used later on to house several dataframes- The
classes
array object will hold all of the unwarted table row element's class names.
Knowing that these elements provide no statistical value, rather than simply “skipping over” them in our parse, they should be instead be completely omitted. That’s to say, permanently removed from any future considerations.
The decompose
method serves to remove such unwanted elements in a page. As per the official Beautiful Soup documentation...
decompose()
Tag.decompose()
removes a tag from the tree, then completely destroys it and its contents.
Below is a snippet of code where we were able to optimize the decompose
method with help of for
loops.
#for loop to iterate through the years
for year in years:
with open("notebooks/Rookies/{}.html".format(year)) as f:
page = f.read()
soup = BeautifulSoup(page, "html.parser")
#for loop cleans up unnecessary table headers from reappearing in rows
for i in classes:
for tr in soup.find_all("tr", {"class":i}):
tr.decompose()
- First
for
loop is used to iterate through the values of ouryears
list object - The
with
method provides our code the structure for thepage
variable to read locally stored HTML files when called on - An HTML parser class is initialized by instantiating the Beautiful Soup class and passing in both the
page
string object, andhtml.parser
. - Second
for
loop iterates through the values in theclasses
array - Third
for
loop utilizes Beautiful Soup’sfind_all
method to identify elements that have bothtr
tags and class names matching those inclasses
tr.decompose
serves to omit each of the identified table row elements from the page entirely
Let’s look to build on this by extracting the data we do want.
Extracting the Data
We can finally start working on the part of the code that actually extracts data from the table. Remember that the table in with all of the relevant data has the HTML unique ID rookies
. The following additions to our code will serve to parse the data of this table.
#the years we wish to parse for
years = list(range(1990, 2017))
#array to house list of dataframes
dfs = []
#unnecessary table headers to be removed
classes = ["over_header","over_header thead", "thead"]
for year in years:
#url = url_base.format(year)
#data = requests.get(url)
##page is saved as an html and placed in Rookies folder
#with open("notebooks/Rookies/{}.html".format(year), "w+") as f:
#f.write(data.text)
#with open("notebooks/Rookies/2000.html") as f:
with open("notebooks/Rookies/{}.html".format(year)) as f:
page = f.read()
soup = BeautifulSoup(page, "html.parser") #for loop cleans up unnecessary table headers from reappearing in rows
for i in classes:
for tr in soup.find_all("tr", {"class":i}):
tr.decompose()
### Scraping Block ###
#identifies, scrapes, and loads rookie tables into one dataframe
rookie_table = soup.find(id="rookies")
rookies = pd.read_html(str(rookie_table))[0]
rookies["Year"] = year
dfs.append(rookies)
#new variable turns list of dataframes into single dataframe
all_rookies = pd.concat(dfs)
For what follows ### Scraping Block ###
rookie_table
instantiates Beautiful Soup'sfind
method to identify the HTML element with IDrookies
- Seeing that Pandas can read HTML tables, the rookie table is loaded into Pandas using the
read_html
method, passing into itrookie_table
as a string - Tacking on to end
[0]
turnsrookies
from a list of data frames into a single data frame - A “Year” column is the end of the
rookies
data frame dfs.append(rookies)
serves to house all of tables of every rookie year in the order they were they iterated into a list of data frames- The Pandas method
concat
is used to combine the list of data frames into one single data frame:all_rookies
Archiving
Our final step involves taking all of this useful, clean information and archiving it in easily human and machine readable CSV format. Tacking on this line to the end of our code will serve to be useful when deciding to come back and reference the data collected.
#a single dataframe archived as a local CSV using Pandas
all_rookies.to_csv("archive/NBA_Rookies_1990-2016.csv")
Final Product
import requests
import pandas as pd
from bs4 import BeautifulSoup
years = list(range(1990, 2017))
dfs = []
classes = ["over_header","over_header thead", "thead"] #unnecessary table headers to be removed
for year in years:
with open("C:/Users/cisco/OneDrive/python/Projects/Basketball/Rookies/{}.html".format(year), encoding = "utf-8") as f:
page = f.read()
soup = BeautifulSoup(page, "html.parser")
#second for loop cleans up unnecessary table headers from reappearing in rows
for i in classes:
for tr in soup.find_all("tr", {"class":i}):
tr.decompose()
#identifies, scrapes, and loads rookie tables into one dataframe
table_rookies = soup.find(id="rookies")
rookies = pd.read_html(str(table_rookies))[0]
rookies["Year"] = year
dfs.append(rookies)
#new variable turns list of dataframes into single dataframe
all_rookies = pd.concat(dfs)
#dataframe archived as local CSV
all_rookies.to_csv("archive/NBA_Rookies_1990-2016.csv")
Closing
Again, the process followed in this walkthrough will undoubtedly apply to most every other page on Basketball-Reference.com. There are five simple steps worth taking in each instance.
- Identify the Page URL
- Download the Page
- Identify the Elements
- Pre-Clean and Extract
- Archive
Following these five steps will help guarantee a quick and successful scraping experience.
Next up in this series will be actually using this data to gain insight on future player potential. So follow me to keep future installments on your radar in the future!