Web Scraping in R

Data Extraction and Data Analyzation using rvest and dplyr

Subhan khaliq
Analytics Vidhya
5 min readDec 10, 2020

--

What is web scraping?

Scraping basically means the extraction of something. So, in Computer Science web scraping means the extraction of data from a website. This data is collected and then exported into a format that is more useful for the user.

Now, we are going to scrape data from the Amazon website. We will scrap data in the form of Prices of Laptops of two companies i.e HP and DELL. So, first of all, load some libraries that are going to help us in scraping

library(xml2)
library(rvest)
library(dplyr)

If you have not installed these packages, then you first install these by the following command

install.packages("xml2")
install.packages("rvest")
install.packages("dplyr")

xml2 library helps us to read Html text and, it extracts various components from nodes.

rvest helps us to scrape information from web pages.

dplyr library helps us in data manipulation i.e Data Sorting etc.

Now, go to the page you want to scrap. I have used a chrome extension “SelectorGadget” to inspect elements. Different browsers have different extensions for it i.e Mozilla Firefox has an extension with the name “ScrapeMate Beta” and other browsers have other extensions. So, you can add this type of extension for selecting data on any web page with your browser.

Step 1

Copy the link of the web page you want to scrap and store it in any variable. Then, read this HTML page.

Step 2

As I want to scrap the price and name of the item. So, I will select the price and name of the item, you can choose any one of them and, the links will be generated by the extension which contains prices and names of all the items on that page. There will be two separate links generate one for price and the other for the name.

As you can see that I have the selected price of one item which is in green and all the prices are selected automatically. The link is also generated, you can see it at the right of the bottom.
The other link will be generated like this by selecting the name of the item.
Now it’s time to write some code.

In this code “%>%” meaning is the same as
x=c(8,99,78,88)
x%>% plot()
OR
plot(x)
Both are the same. I hope you have understood. So, prices and names are stored in these variables you can see in the image.

Step 3

Create a data frame and convert it into a CSV file. So, that we can use it for visualization. And in the CSV file, we can more understand the information.

This technique is for only one web page. Now, if we want to scrape multiple pages.

Scraping of Multiple pages at a Time

If you have done scraping on one page, then it is not a difficult task for you to do. We have the same piece of code with minor changes. First, you need to look up at the link and find where the part of the link changes when you click next to go to the next page.

For example, the link is www.dummy.com/tittles/movie

Now, when clicking next then the link will be like this www.dummy.com/titiles/moviespage=1 or this type of any other changes in the link.

Firstly, I have created a for loop so that we can easily go to the next pages. In the link, I have put a variable name page_no, so the link will change as the loop iterate. If I only create a data frame then the data frame will only contain the data of the last page. Because It updates itself and contains the latest value. So, in this case, rbind helps us to come out from this problem. And it’s a good practice to replace NA(Not a Number) with 0. So, in the future when you want to apply some operation on the data you will have no problem with it.

Here you can see the CSV file.

Repeat the same process for HP laptops prices and names.

Analyze the Data

Now, we have the data which contains the prices of the laptops of both companies i.e dell and hp. We are going to analyze it by plotting them. So, that we can find the average price of both companies for core i7 laptop.

Dell Laptops Prices Average Prices
Hp Laptops Average Prices

You can easily understand from the pictures the average prices of core i7 laptops of both companies. The black line is clearly indicating the average of the prices.

Here in this blog, my purpose is not to tell you that what are the average prices of laptops today?. I only want to tell you that, we can analyze data very easily after scraping. So, even a common person understands it. And the R language helps us to do so. You can use the boxplot function in R to plot this type of data. You can also use many other plots like barplot, pie, histogram, etc. It’s all depends on what type of data you are working with.

I hope this blog will help you in doing scraping with R and analyze data.

Here you can find the complete code

--

--