Scraping Web Pages to Improve Search Engine Results

Today, I found an issue on some pages that needed to be addressed to improve the search engine results on Google. Namely, the title for these pages were missing, causing them to show oddly in a Google search. I wasn’t sure how many pages were affected, but needed to get the information to the SEO team for resolution, I created an R script navigates to each web page, scrapes the HTML, extracts the needed elements from the page, and stores the results in the data frame.

There is an awesome R package called rvest, created by Hadley Wickham, which I find essential for web scraping in R. With this package, I am able to iterate through thousands of pages with just a few lines of code.

Below is my script. I had all of my URLs in a CSV file, so I read the file in using the readr package (another Hadley Wickham package). Then, I used a for loop to iterate through all of the page URLs, scrape the HTML, extract the title tag, and store the extracted data to a new column titled PAGE_TITLE. Finally, I write the data frame out to a CSV file.

library(readr)
library(rvest)
library(dplyr)

allPages = read_csv(...)
allPages$PAGE_TITLE = ""

for (i in 1:nrow(allPages)) {

webAddy = allPages[i,1]
pgTitle = read_html(webAddy) %>% html_nodes("title") %>% html_text()

allPages$PAGE_TITLE[i] = pgTitle
}

write.csv(exportDat, "allPages-PageTitles.csv", row.names = F)

In a little over 30 minutes, I was able to scrape 2,146 pages and identify those pages without a title. Luckily, it was a low percentage of pages missing a title. This sure beats the manual process of going through each page to find the problems. I’ll probably end up restructuring this script to extract the meta descriptions from each page and run the script over again — I’m now finding that meta descriptions are also not being set when the pages are created.


Originally published at what do the data say?.

Show your support

Clapping shows how much you appreciated Mitchell Craver’s story.