Web-Scraping NBA All-Star Data

Amanda Piter
Analytics Vidhya
Published in
7 min readFeb 23, 2021

Learn how to scrape, clean, and merge data frames in R using rvest, janitor, and hablar packages.

Photo by Markus Spiske on Unsplash

Following the announcement of the 2021 NBA All-Star starters and leading up to the March 7th NBA All-Star game, you might be interested in exploring statistics for previous All-Stars.

This article will walk through the steps for using the rvest package in R to scrape datasets from Basketball-Reference.com and then using the janitor package to clean the data before using it for your analysis.

After pulling and cleaning the data we will combine the following datasets for further analysis:

  1. All-Star Players: Lists all the NBA All-Stars and their average stats across all All-Star Games.
  2. O’Neal: Lists Shaquille O’Neil’s average stats by season.

Why Web Scraping?

While a single dataset could easily be downloaded from Basketball-Reference, the true value of web scraping is the ability to update datasets as more information available.

For example, in a previous article NBA big-man, Nikola Vucevic’s player stats updated on Basketball-Reference following each game he played. By utilizing a web-scrapping method, updating the data frame in the code is simple. Re-run the code and additional games from the website are included.

In contrast, by downloading data from Basketball-Reference as a CSV and then importing it into R, each time Vucevic plays a new game the dataset would need to be re-downloaded and imported into R. This becomes increasingly cumbersome if connecting multiple datasets with regular updates.

Installing Packages

In RStudio start by loading the following packages:

library(ggplot2) #version 3.3.3 Graphics 
library(hablar) #version 0.3.0 Converts Data Types
library(janitor) #version 2.1.0 Data Cleaning
library(rvest) #version 0.3.6 Web Scraping
library(tidyverse) #version 1.3.0 Allows for Piping

Web Scraping with rvest

The first step in web scraping is identifying a website for your dataset and assign it to a variable. Here we use a dataset where each row represents a single-player and their average stats across all NBA All-Star games. The page on looks like this:

In R we create a variable for AllStarPlayers and then create a data frame containing the data from the site.

#NBA All-Star Career Stats by Player  
AllStarPlayers <- "https://www.basketball-reference.com/allstar/NBA-allstar-career-stats.html"
#After importing the html site, transform to data table
url <- AllStarPlayers
pageobj <- read_html(url, as.data.frame=T, stringsAsFactors = TRUE)
#Here, we indicate that this is the table we want to extract.
pageobj %>%
html_nodes("table") %>%
.[[1]] %>%
html_table(fill=T) -> AllStarPlayers

Notice for All-Star Player stats there is a single data table on the website we specified. Therefore, as shown above we referenced the page object as [[1]] indicating it is the first (in this case only) data table on the site. Other web pages may have multiple data sets on a single page. For example, typically player game logs are the eighth data table on the corresponding Basketball-Reference pages. In that case, the page object would be [[8]].

There are more sophisticated methods to determine which table the dataset is on a website, but most of the time the easiest method is simply to count.

Data Cleaning: Updating Variable Names

While highly accurate, data sourced from Basketball-Reference needs a bit of cleaning before we can use it in our model. For this dataset, we need to update the variable names and remove subheadings.

The janitor package offers two functions that help with the cleaning process:

  • row_to_names() specify the row in your dataset to use as variable column names.
  • clean_names() converts column names to a standard naming convention. The default is snake_case but you can specify others.
#Convert Row One to Variable Column Names 
AllStarPlayers <- row_to_names(AllStarPlayers, 1, remove_row = TRUE, remove_rows_above = FALSE)
#Converts Column names to follow tidyverse style guide
AllStarPlayers <- clean_names(AllStarPlayers)
names(AllStarPlayers)

The updated variable names after using row_to_names() and clean_names() functions are below:

Image Created By Author in RStudio

Later on, we will scrape Shaquille O’Neal’s player stats to compare his average performance by season with his average performance in All-Star games. While “ast” and “ast2” follow tidy data principles, more descriptive variable names will prove helpful when merging the AllStarPlayer data with what we will call the ONeal dataset.

Below are the assigned default variable definitions when using the clean_names() function.

  • In AllStarPlayers ast: is the total number of assists across all games
  • In AllStarPlayers ast2: is the average number of assists per game
  • In ONeal ast: is the average number of assists per game

By creating more descriptive variable names in AllStarPlayers we can avoid accidentally comparing “ast” in both AllStarPlayers and ONeal data. In reality, we should be comparing what is currently labeled as “ast2” and “ast”.

#More Descriptive Names, for the 4 Variables with Duplicate Names 
names(AllStarPlayers)[names(AllStarPlayers) == "mp"] <- "mins_total"
names(AllStarPlayers)[names(AllStarPlayers) == "pts"] <- "pts_total"
names(AllStarPlayers)[names(AllStarPlayers) == "trb"] <- "trb_total"
names(AllStarPlayers)[names(AllStarPlayers) == "ast"] <- "ast_total"
#Renaming Additional Columns
names(AllStarPlayers)[names(AllStarPlayers) == "mp2"]
<- "mins_per_game"
names(AllStarPlayers)[names(AllStarPlayers) == "pts2"]
<- "pts_per_game"
names(AllStarPlayers)[names(AllStarPlayers) == "trb2"]
<- "trb_per_game"
names(AllStarPlayers)[names(AllStarPlayers) == "ast2"]
<- "ast_per_game"
names(AllStarPlayers)
Image Created By Author in RStudio

Data Cleaning: Remove Subheadings

The AllStarPlayers dataset has multiple rows that repeat the variable heading names. This is a useful feature when scrolling through large datasets online, as there is always a row displayed on the screen to reference.

In a R data frame, these subheadings leave rows of data as character values and prevent us from using numeric and integer class types. Eliminating these rows will allow us to convert data types and perform calculations.

#Janitor Function, That Shows Rows with Duplicate Values 
get_dupes(AllStarPlayers)
Image Created By Author in RStudio
#Removes Rows with subheadings
AllStarPlayers <- AllStarPlayers[!(AllStarPlayers$player ==" Player"),]
#Removes Rows without values for player variable
AllStarPlayers <- AllStarPlayers[!(AllStarPlayers$player ==""),]

Create New Variables

Data scraped with the rvest package are imported as character values. Before using mathematical operators to create new variables, the data types of the current variables need to be converted. Here we convert the specified columns to integer and numeric data types.

#Use hablar Package to Convert Data Types 
names(AllStarPlayers)
AllStarPlayers <- AllStarPlayers %>% convert(
int("g", "gs", "mins_total", "fg", "fga", "x3p", "x3pa", "x2p",
"x2pa", "ft", "fta", "orb", "drb", "trb_total", "ast_total",
"stl", "blk", "tov", "pf", "mins_per_game", "pts_per_game",
"trb_per_game", "ast_per_game"),
num("fg_percent", "x3p_percent", "x2p_percent", "ft_percent"))

Most of the stats in AllStarPlayer are calculated as totals across all All-Star games. Other data sets, especially those for individual players, calculate the stats as averages per-game. Having variables on the same scale allows us to merge and compare data.

Converting the totals to per-game averages can be done using traditional math operators. Below is an example of calculating the average Field Goals, Field Goal Attempts, and 3-Pointers per game.

#Create New Variables Per Game
AllStarPlayers$fg_per_game <- AllStarPlayers$fg / AllStarPlayers$g
AllStarPlayers$fga_per_game <- AllStarPlayers$fga / AllStarPlayers$g
AllStarPlayers$x3p_per_game <- AllStarPlayers$x3p / AllStarPlayers$g
#Rename Exisiting Variables
names(AllStarPlayers)[names(AllStarPlayers) == "fg"] <- "fg_total"
names(AllStarPlayers)[names(AllStarPlayers) == "fga"] <- "fga_total"
names(AllStarPlayers)[names(AllStarPlayers) == "x3p"] <- "x3p_total"

Create & Clean O’Neal Data Frame

Similar steps can be followed to scrape and clean the O’Neal player stats table. Since it is a different dataset a couple of slightly different cleaning steps might be necessary. For example, the bottom rows will need to be removed, and a new player column created.

#Removed Rows 22 to 29 
ONeal <- ONeal[-c(22:29),]
#Add Player column to Oneal
ONeal$player <- "Shaquille O'Neal"
#Add Season Column to AllStarPlayers
AllStarPlayers$season <- "AllStar"

Prepare Data to Merge

Once both data sets, AllStarPlayers and ONeal, are scraped and cleaned we can merge them together for further analysis. To combine files, they will need to have the same variables and those variables will need to have the same data types.

Instead of individually comparing the files we can use the janitor function compare_df_cols_same().

#Compare AllStarPlayer to ONeal 
compare_df_cols_same(AllStarPlayers, ONeal, bind_method = "rbind")

Based on the output we can then clean up the files by removing variables not found in both data sets and updating data types as needed.

#Remove Variables from AllStarPlayer not in ONeal Data
AllStarPlayers <- AllStarPlayers %>%
select(-ast_total, -blk_total, -drb_total, -fg_total, -fga_total,
-ft_total, -fta_total, -orb_total, -pf_total, -stl_total,
-tov_total, -x2p_total, -x2pa_total, -x3pa_total,
-x3p_total, -mins_total, -pts_total, -trb_total)

With compatible files, we use rbind() to merge the two sets together and filter() to select the data for Shaquille O’Neal

#Combine AllStarPlayer and ONeal Datasets 
AllStarONeal <- rbind(AllStarPlayers, ONeal)
#Subset to ONeal Games
AllStarONeal <- filter(AllStarONeal, player == "Shaquille O'Neal")

Conclusion

Images Created by Author in RStudio

Equipped with a cleaned and merged dataset, we are now prepared to tackle numerous projects in R. From data visualizations using ggplot, like the graphs above, to building machine learning models, all projects start with good data. Below is a summary of the steps we used today:

  1. Use rvest to scrape an online dataset. In this case the AllStarPlayer data and ONeal datasets.
  2. Use janitor & hablar packages to clean data. Specifically the row_to_names(), clean_names() and get_dupes() functions.
  3. Combine two or more datasets using rbind() after using the compare_df_cols_same() function.

I encourage you to use the steps discussed above to combine multiple datasets in innovative ways. While the All-Star player stats are a great place to start, insights become more powerful when combined with metrics from multiple sources.

I look forward to seeing where your analysis takes you.

--

--

Amanda Piter
Analytics Vidhya

MBA Candidate at the University of Florida with a focus in Data Analytics & Strategy.