Scraping web data with R & the Tidyverse

Published in

EPFL Extension School

16 min readApr 16, 2019

Being an employee of the EPFL Extension School has its perks.

For the past month, I’ve been following one of our courses, “50 Things You Need to Know About Data”, an introductory course designed to teach the basics of the tools and technologies needed to work with data in the digital age.

Specifically, the course teaches how to store, structure, clean, visualize, and analyze data using the R programming language — and it provides a broad survey of topics like machine learning, data privacy, and other topics as well. Having finally made my way through the technical portion of the course, I thought that writing a data article about my favorite show, Twin Peaks, would be the perfect opportunity to put some of what I learned in the course to the test.

Before we begin, I want to properly represent my technical background. As a course developer for the Extension School, I create courses about programming for web applications. My area of expertise is the JavaScript programming language, and it’s been several years since I worked as a business analyst using MS Excel spreadsheet software. A month ago I wrote my first line of R code, and this course is the only instruction that I pursued to learn R. But I do happen to share an office with the creator of the course, Xavier, whose time and enthusiasm I owe to much of my progress.

But despite my more advanced technical background, this course is for beginners. If you’ve worked with basic spreadsheets before, you’ll find the material very approachable. And sure; it helped to have Xavier sitting at a desk across the room from me, but if you take the course you’ll get to chat with him for a half-hour each week about any course topic that you’d like to discuss.

This article won’t showcase all of what I learned in the course. The course teaches how to work with data from many different sources, including databases, .csv files, MS Excel spreadsheets, Rest APIs, and more. The course also takes-on a more corporate focus in its choice of data. For example; AirBnB rentals for the entire state of Texas is one of the data sources that you get to play-around with if you take the course.

But I’m a web guy, so this article focuses on a particularly crafty data source: web scraping.

You remember Twin Peaks, don’t you? The show about owls, cherry pie, deep black coffee, and the murder-mystery of Laura Palmer?

This show is a favorite of mine, and I think that if you ever make the time to watch it, you’ll enjoy it too.

But don’t take it from me. Take it from IMDB, the Internet Movie Database, where the lowest rating for any episode of the original series was 7.5 out-of 10.

If you spend-enough time around me, you’re bound to get bitten by the Twin Peaks bug too, and this year I finally managed to convince my colleague to watch the show too. As the creator of the course, he was aware of my progress, and challenged me to put my skills to the test by creating a couple of charts about the show.

Here’s how the conversation went:

Twin Peaks, for those of you who don’t know, is an all-time classic that originally aired from 1989 until 1991. Most critics see Twin Peaks as a show that was ahead of it’s time; pioneering a genre that would later bring viewers Breaking Bad, Stranger Things, another other modern favorites.

In the tweet, Katie Segretti, author of DataChips.com, shares two charts that she made about one of her favorite shows; Broad City.

A close-up of the first chart shows the popularity of each episode plotted for all five seasons of Broad City. A similar chart about the popularity of each episode of Twin Peaks would require the title of each episode, and its IMDB rating too. So my challenge was simple: find the same IMDB data about Twin Peaks, and use my knowledge of R create the two visualizations.

Since IMDB doesn’t have a public API or database that we can query to get the information that we need, we’ll have to scrape it using R. Scraping is the process of extracting targeted data from HTML pages, so that we can use that data in our work. This is one of several ways to source data that is taught in the course, and we’ll use the scraped data to plot visualizations like the one above.

We’ll begin with the Twin Peaks Wiki. Notice the blue and yellow boxes in the image below, which indicate some of the elements to be scraped. The Wiki doesn’t include the IMDB rating, but it does include other information which will be necessary to create the charts.

Using the Web Inspector tool in Firefox, we can view each of these elements as HTML. This helps us to understand the order of these elements inside the HTML document — information which will be used to scrape more precisely and efficiently. Below you can see that the node of the HTML document that contains the name of the director of episode 9, David Lynch, is contained within a list-item with the class "pi-data-value".

Using R to scrape the director for episode 9 is quick work. In fact; it only takes a few lines of code. Remember that the director’s name is contained within an "li" tag that has the class "pi-data-value". The script below uses this selector to determine which node of text to scrape from the page. In this case, the director’s name is contained within an "li" with the class "pi-data-value", which itself is nested inside of a "div" with a "data-source" attribute of "Director".

library(tidyverse)

library(rvest)

read_html("https://twinpeaks.fandom.com/wiki/Episode_9") %>%
  html_node("[data-source='Director'] .pi-data-value") %>%
  html_text()

Notice that here, we’re not passing an argument to the read_html function. That’s because any function call that follows the %>% operator receives the returned value of the previous function as it’s first parameter.

The title of that episode can be scraped the same way, using a slightly different selector that targets the title instead of the director.

read_html("https://twinpeaks.fandom.com/wiki/Episode_9") %>%
  html_node("[data-source='Series name'] + section .pi-data-value") %>%
  html_text()

This episode was originally titled “Episode 9”, which is why the text above has the word “or” in it. We don’t need to include that word in our visualization though, so let’s take it out using one additional function.

read_html("https://twinpeaks.fandom.com/wiki/Episode_9") %>%
  html_node("[data-source='Series name'] + section .pi-data-value") %>%
  html_text() %>%
  str_remove("or: ")

Easy, isn’t it? Damn easy, some might say.

Let’s take this opportunity to re-factor the code to use a function. This function will receive an episode URL, and it will return the title for that episode. We’ll test it out with episode number 16.

get_title <- function(episode_url) {
  read_html(episode_url) %>%
  html_node("[data-source='Series name'] + section .pi-data-value") %>%
  html_text() %>%
  str_remove("or: ")
}get_title("https://twinpeaks.fandom.com/wiki/Episode_16")

Functions help us to organize our code into re-usable modules. Let’s create a similar function that we can use to generate an episode’s URL from it’s sequence number. For example; if we wanted the URL for episode 22, we could pass it to this function as an argument, and the function would return to us the full URL for episode 22.

get_episode_url <- function(episode_number) {
  str_c("https://twinpeaks.fandom.com/wiki/Episode_",
  episode_number,
  .sep = "")
}get_episode_url(22)

When we call the function and pass a number to it as an argument, it returns the full URL to us, as expected.

We use this function to map-through all 29 episodes that aired between 1989 and 1991 using the map function. In case you’ve been using R for a while, map is similar to lapply. This is what it looks like to map-through a vector of the numbers 1 to 29. We provide both the vector of numbers, and the get_episode_url function to map as arguments, and it returns to us a list of all 29 URLs.

episode_urls <- map(1:29, get_episode_url)episode_urls %>%
  unlist()

We can now even map-through the episode URLs, passing each URL to the get_title function. map_chr works just like a regular map, except that it will always return a character.

episode_titles <- episode_urls %>%
  map_chr(get_title)episode_titles

Something isn’t quite right about our list of episode titles. First of all, any true fan will notice instantly that there are actually 30 episodes in the original series, not 29! It turns out that the Twin Peaks Wiki uses a different URL pattern for the pilot episode. The script below scrapes the title from the pilot episode’s page. Notice that the URL ends in "/Pilot" instead of "/Episode_0", or "/Episode_1" as might be expected, given the pattern that is used for the subsequent 29 episodes.

get_title("https://twinpeaks.fandom.com/wiki/Pilot")

This isn’t a problem though. We can simply prepend the episode titles with the pilot’s title like this:

episode_titles <- episode_titles %>%
  prepend(get_title("https://twinpeaks.fandom.com/wiki/Pilot"))episode_titles

Notice that the first item is now the title of the Pilot episode, and the list contains 30 episodes in total

We now have the titles for all 30 episodes, including the pilot. It’s time to get the IMDB ratings for each episode. These are found on the IMDB pages about Twin Peaks, which you can see a preview of below. Notice also the green box in the top-left corner. IMDB splits the episode information into two pages, so once again we’ll have to account for multiple URLs in our script.

The URLs for IMDB aren’t as pretty as the Twin Peaks Wiki ones. Below we create another function, get_season_url that returns the IMDB url for each season when mapped to the numbers 1 and 2. Pay attention to the part of the URL that reads season=1 or season=2.

get_season_url <- function(season_number) {
  str_c("https://www.imdb.com/title/tt0098936/episodes?season=",
  season_number,
  "&ref_=ttep_ep_sn_nx",
  .sep = "")
}season_urls <- map(1:2, get_season_url)season_urls

The ratings are contained inside of a span tag with the class "ipl-rating-star__rating". We can get the IMDB ratings using the same code that we used to get the titles for all episodes from the Twin Peaks Wiki. This will return to us a list of the episode ratings for each season. The "as.numeric" function at the end of the script simply ensures that we’re returning the ratings in a number format, rather than as text.

get_episode_ratings <- function(season_url) {
  read_html(season_url) %>%
  html_nodes(".ipl-rating-star.small .ipl-rating-star__rating") %>%
  html_text() %>%
  as.numeric()
}ratings <- map(season_urls, get_episode_ratings)ratings

This is already far more helpful. But let’s go one step further, and convert the ratings list of two items ("Season 1" and "Season 2") into a “tibble”, the data-frame of choice in the Tidyverse.

Using the enframe function, ratings gets converted to a tibble with two columns: season and ratings.

ratings <- ratings %>%
  enframe(name = "season", value = "ratings")ratings

The data-frame above contains a column for season, and one for ratings. There are only two rows, but each rating is a vector of multiple ratings; the first containing 8, and the second 22. Together these represent all 30 episodes, and we can view the whole set easily via the unnest function, which will expand or “un-nest” the vectors for us.

ratings <- ratings %>%
  unnest()ratings

Finally, we can combine these two sets of values — the titles and the ratings — into their own tibble, which we can then use to make our first visualization.

twin_peaks_episodes <- tibble(title = episode_titles,
                              season = ratings$season,
                              rating = ratings$ratings)twin_peaks_episodes

But there’s a problem with the table.

It comes back to the funny way that episodes of Twin Peaks didn’t originally have titles. When the penultimate episode aired on June 10th, 1991, it was titled “Episode 29”. Later, the creators of the show, David Lynch and Mark Frost called it “The Night of Decision”, but the fans began referring to the episode as “Miss Twin Peaks” — the name of the beauty pageant that the episode surrounds. Our scraping script picked-up both of these titles, and even reference to a footnote about the story of two names.

We can view episode 29 in isolation using the slice function, which will return a tibble containing just the row that we asked for — 29, in this case.

twin_peaks_episodes %>%
  slice(29)

Fixing this programmatically shouldn’t take much work. To keep things simple, we’ll just replace the double-title with the title “Miss Twin Peaks”.

twin_peaks_episodes <- twin_peaks_episodes %>%
  mutate(
    title = if_else(
      str_detect(title, "The Night of Decision"),
      "Miss Twin Peaks",
      title
    )
  )twin_peaks_episodes %>%
  slice(29)

The twin_peaks_episodes tibble is now ready to go! The code below plots our ratings for each episode and generates a clean-looking chart.

library(hrbrthemes)plot_episode_popularity <- function(episode_table) {
  ggplot(episode_table,
         aes(x = forcats::fct_reorder(title, episode),
             y = rating)) +
  geom_point(aes(color = season),
             shape = 15,
             size = 4) +
  geom_text(aes(label = title),
            angle = 270,
            size = 3,
            hjust = 0,
            nudge_y = - 0.2) +
  labs(x = "Episodes",
       y = "IMDB Rating",
       title = "Twin Peaks episodes ranked by IMDB") +
  theme_ipsum_rc(grid="Y") +
  theme(axis.text.x=element_blank(),
        panel.grid.major = element_line(linetype = 4),
        legend.title = element_blank()) +
  ylim(0, NA)
}twin_peaks_episodes %>%
  mutate(episode = row_number()) %>%
  plot_episode_popularity()

Now that we’ve successfully reproduced the first chart from Katie’s tweet, let’s have a look at the second chart, which visualizes the centrality of different characters to the show’s plot, based on the number of times they are mentioned in the episode summaries on IMDB.

In the case of Twin Peaks, we actually have the credits for each episode easily available to us on the Twin Peaks Wiki. We’ll simply need to scrape the cast data, and then filter-out any irrelevant information, like credits for stunt-doubles, etc.

You can see where we’ll be scraping this data from in the GIF below. Each cast member exists as a hyperlink inside-of an "li" element. This information will be used to select the names of the cast members when we scrape the data.

Below you’ll see a similar functions that we used to scrape the titles from the Twin Peaks Wiki, only modified to select the episode cast and characters information. We’ll also use the exact same episode_urls variable from above. The amount of nodes being scraped here quite large, so I’m outputting the cast for episode 21 only to show the raw scraped data.

get_cast <- function(episode_url) {
  read_html(episode_url) %>%
  html_nodes(".WikiaArticle > div > ul > li") %>%
  html_text()
}cast <- map(c("https://twinpeaks.fandom.com/wiki/Pilot", episode_urls), get_cast)cast[21]

The credits variable contains all of the scraped credits data from all 30 episode pages (including the pilot). But it contains a lot of irrelevant data too. Have a look at the credits data for episode 21 above, we can see that it includes credits for voice actors, non-appearing cast members ("credit only"), and even some notes about the episode as well.

We can remove some of the white-space and formatting from the credits using a few simple functions. Notice that the list below is just slightly cleaner than the one above. This is because we removed all of the "\n" patterns from the strings.

clean_cast <- function(episode_cast) {
  str_squish(episode_cast) %>%
  str_trim() %>%
  str_remove_all("\\[([^\\]])\\]") %>% # remove citations
  str_replace_all("'", '"') # convert single-quotes to double
}cast <- map(cast, clean_cast)cast[21]

Removing the voice-only credits, and other records that we aren’t interested in is also quite simple. The function below will keep the records that we want, and remove any of the ones that we don’t. Each episode’s cast list gets mapped-to the filter_credits function, and the output is a much tidyer list of actual cast members who appeared in the episode.

filter_credits <- function(episode_credits) {
  keep(episode_credits, str_detect(episode_credits, " as ")) %>%
    discard(str_length(.) > 50) %>% # remove any actual text that was scraped
    discard(str_detect(., "credit only")) %>% # remove non-appearing cast members
    discard(str_detect(., "performer")) %>% # remove unknown actors
    discard(str_detect(., "voice")) %>% # remove voice-only cast members
    discard(str_detect(., "deleted scene")) %>%
    discard(str_detect(., "archive footage")) %>%
    discard(str_detect(., "citation needed")) %>%
    discard(str_detect(., "Invitation To Love")) # remove soap opera credits
}cast <- map(cast, clean_cast) %>%
  map(filter_credits)cast[21]

Much nicer. Now we have a tidy list of the cast members for each episode. But still; what we really need are just the characters that appear in each episode. Thankfully, this is quick-work too. When mapped to a string, the str_extract function will remove whatever pattern we pass to it.

characters <- map(cast, str_extract, pattern = "(?<= as ).*")characters[21]

twin_peaks_episodes <- twin_peaks_episodes %>%
  mutate(characters)twin_peaks_episodes

How simple was that! The new column contains rows that look like <chr [33]>. This means that the row contains a list of 33 strings, or “characters” as they’re called in R. No pun intended.

In order to find the total number of appearances for each character throughout the show, we need to first “un-nest” the column of characters in each episode. The output below will show you what I mean by “un-nesting”. The unnest function is going to expand our tibble from just 30 rows to a whopping 955 rows! It does this by creating a row for every single character appearance. In a way, this changes the nature of our tibble from a tibble of Twin Peaks episodes, to a tibble of all character appearances throughout seasons 1 and 2 of Twin Peaks.

all_character_appearances <- twin_peaks_episodes %>%
  unnest()all_character_appearances

The tibble above contains all appearances for every character in the show between seasons 1 and 2. By applying the filter function to filter for a particular character, we can quickly find all of episodes that a character appears in.

all_character_appearances %>%
  filter(characters == "Betty Briggs")

It’s time now to summarize the character appearances. We want a list containing each character, as well as the total number of episodes that they appeared in. Using the group_by function, we can group-together all of the episodes that each character appeared-in. For Betty Briggs, this would mean that we’re grouping-together the 7 appearances that you can see in the tibble above.

group_by is different from filter though, because group_by will always return the entire data set that was passed to it. But it enables us to use another very useful function: summarize.

We’ll use summarize to count the number of episodes that each character appears in, and to return a tibble of all 189 characters and the number of appearances for each.

appearances_per_character <- all_character_appearances %>%
  group_by(characters) %>%
  summarise(total_appearances = n()) %>%
  arrange(desc(total_appearances))appearances_per_character

Let’s quickly try applying the same “Betty Briggs” filter to the appearances_per_character tibble. This time, it will return a single row, but the value of the total_appearances column will be 7; the number of rows that was returned the last time that we filtered for Betty Briggs.

It’s almost time to plot our chart.

But before we do that, let’s filter the list of characters down to a smaller, more relevant list. It isn’t very interesting to know about minor characters who only appeared in one or two episodes. In fact, let’s filter-out the bottom third of all characters, keeping only those who appear in more than 10 episodes. Using the filter function, we can refine the number of characters to plot from 189 to just 29. This should make for a far more readable chart.

frequently_appearing_characters <- appearances_per_character %>%
  filter(total_appearances > 10)frequently_appearing_characters

Finally; using many of the same functions that we used to produce the IMDB rankings chart, we can plot a horizontal column chart of the characters according to how many episodes they appear in.

frequently_appearing_characters %>%
  ggplot(aes(x = fct_reorder(characters,
                             total_appearances),
             y = total_appearances)) +
  geom_col(mapping = aes(fill = characters)) +
  coord_flip() +
  labs(y = "Total appearances",
       title = "Twin Peaks characters by their appearances") +
  theme_ipsum_rc(grid="X") +
  theme(legend.position="none",
        axis.title.y = element_blank())

In this post, we scraped HTML data from several web pages to compile data about the cult classic TV series Twin Peaks, and then we created two charts from our scraped data.

But before you get the wrong impression, and think that “50 Things You Need to Know About Data” is a course about web scraping — I need to be clear that this is just one topic among many that you’ll study if you follow the course.

As a newcomer to data science and data visualization, I found this course inspiring. And as a web developer, it was refreshingly fun. I’d recommend it to anyone who is interested in working with data, but who hasn’t yet wrangled data programmatically.

Special thanks to Katie Segretti, whose charts were the inspiration for this post, and who graciously agreed to let me use her visualizations here as well.

About the EPFL Extension School

The EPFL Extension School teaches digital skills in data science, machine learning, web application development, UI Development with React.js, and more. Please visit our website, reach out to us by e-mail or Twitter, and most importantly: go get those digital skills!

Scraping web data with R & the Tidyverse

Written by Harry Anderson