Nika Akin from Pexels

Two Approaches to Scrape Data From CapFriendly Using RSelenium and rvest

There are many methods to scrape data but some ways are more robust and simpler than others. Let’s take a look at a couple of ways to extract data from a website that utilizes pagination, the splitting of data into pages.

Christian Lee
Published in
4 min readDec 26, 2020

--

I previously wrote this article on scraping data from nhl.com/stats. That involved using rvest to emulate clicking the “next” button on a webpage to get access to new data records. CapFriendly has a different and more complicated configuration so here I will demonstrate two ways we can modify our previous script to extract salary information.

For this tutorial, we are interested in NHL player cap hits from the 2019–2020 season. CapFriendly contains this information but the site only shows a maximum of 50 rows spanning 31 pages. Instead of going page-by-page and manually downloading these tables and piecing them together, we will automate the process. Below is a screenshot of the first 10 rows.

Extracting the first set of records

This first part is the same for both methods. We begin by loading the required packages.

## load the required packages
library(RSelenium)
library(rvest)
library(dplyr)

Now we specify the URL, start the Selenium server, and instruct it to navigate to the page. In our case, this means a new chrome window will open. You may need to specify a different chrome version.

url = "https://www.capfriendly.com/browse/active/2020?hide=age,handed,expiry-status"rD = rsDriver(port=4444L, browser="chrome", chromever="87.0.4280.88")remDr = rD[['client']]
remDr$navigate(url)
src = remDr$getPageSource()[[1]]

We begin by scraping page 1 entries which is also useful because it builds a data frame (df) that we can just bind additional data to.

df = read_html(src) %>% 
html_nodes("table") %>%
html_table(fill = TRUE) %>%
data.frame(., stringsAsFactors = F) %>%
mutate_all(as.character)
dim(df)
[1] 50 20

As expected, our data frame is 50 rows by 20 columns. We converted all of the columns to type character to eliminate downstream binding errors (this isn’t always a necessary step).

Method 1: modified rvest CSS selector

This is a little tricky (trickier than method 2) because there is not a dedicated “next” button as there typically are on other websites. Instead, we have to click the number of the desired page. This is further complicated because there is not a fixed number of buttons that appear.

For example, page 1 shows:

and page 5 shows:

This impacts the child pagination selector we will use to advance the page. Taking a closer look, pages 1–3 have i + 3 child buttons and pages 4–31 have 7 in total. To advance from page to page, we will set the css selector to have the ith + 1 and 5th child for pages 1–3 and 4–31, respectively.

CSS selector

The css_selector from the code below was determined by inspecting (right click → Inspect) the buttons and copying the selector. This is shown in the image below:

for (i in 1:30) {
cat(i, " ")
if(i <= 3){
num = i + 1
css_select = paste0('#pagin > div > div:nth-child(1) > a:nth-child(',num,')')
} else{
css_select = "#pagin > div > div:nth-child(1) > a:nth-child(5)"
}

## click element
pages = remDr$findElement(using = "css selector", css_select)
pages$clickElement()
Sys.sleep(3)

## extract
src = remDr$getPageSource()[[1]]
temp = read_html(src) %>%
html_nodes("table") %>%
html_table(fill = TRUE) %>%
data.frame(., stringsAsFactors = F) %>%
mutate_all(as.character)
## bind new data
df = df %>% bind_rows(temp)
df = data.frame(df, stringsAsFactors = F)
}
dim(df)
[1] 1509 20

Method 2: URL for loop

For this approach, we will simply change the URL and extract the html table. This is much simpler than the first method and is less likely to break.

for (i in 2:31) {
cat(i, "")
Sys.sleep(2) #allow the page to load

## set URL with page index
url = paste0("https://www.capfriendly.com/browse/active/2020?hide=age,handed,expiry-status&pg=",i)
remDr$navigate(url)
src = remDr$getPageSource()[[1]]
## extract
temp = read_html(src) %>%
html_nodes("table") %>%
html_table(fill = TRUE) %>%
data.frame(., stringsAsFactors = F) %>%
mutate_all(as.character)
## bind
df = df %>% bind_rows(temp)
}
dim(df)
[1] 1509 20

It is worth noting that the extracted data frames contain the same information, however, some of the records are ordered differently by secondary columns. This is nothing to be concerned about since all the data itself is the same and can easily be reordered. Also, there are several instances where the same player has multiple entries often as a result of different cap hit values.

Conclusion

Here, we covered two methods to extract data in R from a website with pagination. Without a dedicated “next” button, looping through URLs seems like the way to go.

Code

All code available here.

--

--

Christian Lee
Hockey Stats

Medical student. Computational biologist. Sport stats enthusiast.