Web Scraping — Part 2

Published in

Analytics Vidhya

4 min readDec 12, 2019

Simultaneously scraping multiple web pages with R

This short tutorial will be on how to scrape multiple pages on a webpage simultaneously. This tutorial assumes you can use the google chrome css selector gadget. If not, see the first part here.

Due to the fact that some web pages contains large chunks of data (or text) e.g a comment section, comments usually spills to the next page. It could also be for sorting, whatever case it may be, it would be very difficult scraping each of these pages individually. For this reason, I thought of creating a function that easily scrapes multiple pages.

I will be attempting to scrape the lyrics of an artist (Travis Greene), since artists usually have quite a number of songs in their portfolios, it would be extremely tasking and strenuous attempting to scrape individual pages. So lets find out how to make it easy to scrape these pages.

#Lets load in the required packages
library(rvest)
library(tidyverse)
library(stringr)

Now we attempt to read in the url using the rvest and also get the css tag containing the list of songs.

Now we locate the css tag for the list of songs using the selector gadget containing the lyrics.

#load in the url and 
#First we extract the central page url
page_url <- read_html("http://www.songlyrics.com/travis-greene-lyrics/") %>% #Read the central url page containing lyrics listhtml_nodes("#colone-container .tracklist a") %>% #Get the css tag for the listed songs  html_attr("href")  #Extract the url of each page from the css tagpage_url

The image above shows the result of the above code. What we have is the url(link) to all the songs listed on the cental page.

Next we extract the lyrics names of all the lyrics (or lyrics Title)

page_title <- read_html("http://www.songlyrics.com/travis-greene-lyrics/") %>% 
  html_nodes("#colone-container .tracklist a") %>% 
  html_text()page_title

Having obtained the lyrics name, next we create a function that reads the individual urls and then loads the lyrics css selector in the individual pages. Don’t worry, I made the steps as easy as possible.

Find the css tag of the lyrics from the lyrics url obtained above, Just click one of the lyrics and then get the css tag.

We can see now that the #songLyricsDiv is the css tag containing the lyrics.

2. Now lets create the function that will enter pages and then read the css tags (in this case the lyrics) of the page.

read_lyrics <- function(url) {
  tibble(
    lyrics = 
      url %>%
      read_html() %>% 
      html_nodes("#songLyricsDiv") %>%
      html_text() 
  )
}

Remember the above is a function and we must apply it. To apply:

page_url[1:9] %>% #We select just nine urls already obtained as trial 
  set_names(1:9) %>%  #A column that indicates the postion of each url 
  map_dfr(read_lyrics, .id = "Title")  %>% 
  bind_cols(title = page_title[1:9]) #Bind the title of the first nine songs

From the 3rd line of code above, we use the function we defined earlier and then we pass into that function each individual urls using the tidyverse pipe.

What happens is that, for each individual lyrics in the page url, the function we defined enters into that url passed into it and then obtains the css tag defined in the function and returns the scraped lyrics. It does this iteratively.

What happens in the last line of code is that, since we have obtained already the title of the songs stored in page_title, we extract the first nine songs and then column bind with the extracted lyrics since we are working with also only the first nine urls.

The above image should be your result.

Conclusion

We have briefly explored how to scrape multiple pages, of course, different web sites may require you to tweak a bit the code above, but it sure does get the job done. Find the full script to this tutorial here.

Web Scraping — Part 2

Conclusion

Written by Kafaru simmie