Photo by Asya Vlasova from Pexels

How to Scrape (NHL.com) Dynamic Data in R Using rvest and RSelenium

NHL.com is a great reference and database, except for one thing: you can only download records 100 rows at a time. Unacceptable. Let’s look at how to speed things up in R. Queue The Social Network intro…

Christian Lee
Published in
5 min readDec 5, 2020

--

R packages we will use:

  1. RSelenium
  2. rvest
  3. dplyr (recommended)

We will be using RSelenium to simulate a human clicking the “next” button. This also requires Java Developer Tools (JDK) to work. Note that if the table you are interested in is static, then the task is simpler and all you need is rvest. That will be a post for another time.

What we are interested in

Let’s say we want to analyze Corsi and its relationship with team data across an entire season. In order to do this, we need to scrape a dataset of shot attempts meeting the following requirements:

  • 2018–2019 (full 82 game regular season)
  • Game stats for all teams
  • Only keep home team stats (eliminates duplicate games for downstream analyses)

It should look something like this:

What we want to extract

Here we see the top 10 entries after sorting the SAT by descending order. The first 10 rows of our scraped data frame should match these exactly. There are 1271 total records, which should also be the number of rows in our scraped data frame.

Top 10 entries from nhl.com

Let the fun begin

First, load the R packages.

library(RSelenium)
library(rvest)
library(dplyr)

Next, we will define the URL and start a Selenium server and browser that will navigate to the address we specified. This is the most finicky part and any issues are typically related to the chrome version. I recommend trying different chromever that are closest to your specific chrome version, which does frequently and automatically update. In some cases, I have used firefox in place of chrome and that worked just fine.

url = “http://www.nhl.com/stats/teams?aggregate=0&report=summaryshooting&reportType=game&dateFrom=2018-10-02&dateTo=2019-04-06&gameType=2&homeRoad=H&filter=gamesPlayed,gte,1&sort=satTotal&page=0&pageSize=50"rD = rsDriver(port=4444L, browser=”chrome”, chromever=”87.0.4280.88") #specify chrome versionremDr = rD[[‘client’]]
remDr$navigate(url) #this will open a chrome window
src = remDr$getPageSource()[[1]] #select everything for now

Initial scrape

Already, here we extract the first set of data from page 1 (page 0 in the URL). We are using dplyr so we don’t need to define unnecessary variables. The code looks quite messy but it is actually quite simple. The only tricky part is using the correct xpath in line 2. Fortunately, SelectorGadget is your friend with a catchy name. Designating this path in xml_nodes will allow us to select the React table to extract from.

From there, all we do is scrape the text and create a matrix with the number of columns set to whatever is in the table of interest. In our case, 19 columns.

df = read_html(src) %>% 
xml_nodes(xpath=’//*[contains(concat( “ “, @class, “ “ ), concat( “ “, “rt-td”, “ “ ))]’) %>%
xml_text() %>%
matrix(.,ncol=19, byrow = T) %>%
data.frame(.,stringsAsFactors = F)

If we take a look at the head of df, we see that it matches the first 6 rows from the table above. Great, almost there.

head(df)

Loops of scrapes

Our table of interest has 26 pages if we view 50 rows at a time. Now, we will just repeat the step from above for the other 25 pages. (We are starting at 2 because 1 was the previous step). This way we can just add to that data frame and not define an empty one. This was informed by this stackoverflow post.

In this loop, we will find the “next” arrow/button on the webpage and “click” it using RSelenium’s function clickElement(). You should see the data update on the opened chrome window.

for (i in 2:26) {
pages = remDr$findElement(using = “css selector”,”.-next”) #we are selecting the next button
pages$clickElement()

## wait 3 seconds to load (can reduce this)
Sys.sleep(3)

src = remDr$getPageSource()[[1]]
temp = read_html(src) %>%
xml_nodes(xpath=’//*[contains(concat( “ “, @class, “ “ ), concat( “ “, “rt-td”, “ “ ))]’) %>%
xml_text() %>%
matrix(., ncol=19, byrow = T) %>%
data.frame(., stringsAsFactors = F)

## bind new data
df = df %>% bind_rows(temp)

}

The hard part is over. Now, we just need to double check we did everything correctly, clean up and save. There will be some blank rows at the end of the data frame if the number of entries is not an exact multiple of the number of entries (50) per page. One quick indication if we were successful is to double check that the number of games for each team is 41 (home games).

## remove empty rows and keep the first 12 columns
df_cleaned = df[nchar(df$X3) > 1, 1:12] #the blank rows have a space
dim(df_cleaned)
[1] 1271 12
## double check it worked by counting the number of HOME games played by each team (41)
unique(table(df_cleaned$X2)) == 41
[1] TRUE
## add column names
colnames(df_cleaned) = c(“index”, “team”, “game”, “gp”,
“shots”, “sat_for”, “sat_against”, “sat”,
“sat_tied”, “sat_ahead”, “sat_behind”, “sat_close”)
or## extract headers from table
#header = read_html(src) %>%
# xml_nodes(css="div[role='columnheader']") %>%
# xml_text()
## save file
save(df_cleaned, file = "~/Documents/hockey-stats/data/1204_nhl_home_sat_stats_2018-2019.rsav")

Lastly, shut the server down.


remDr$close()
rD$server$stop()
gc()

Code/Github

Nicer looking code here.

Summary

We successfully and quickly extracted a full set of React tables from nhl.com without having to manually download and combine several tables of stats. Stay tuned for the upcoming analysis of the data!

--

--

Christian Lee
Hockey Stats

Medical student. Computational biologist. Sport stats enthusiast.