How to Web scrap with RStudio

FAISAL ARDIANSYAH
3 min readDec 19, 2018

--

import.io

Did you Know about Scraping?

Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format.

Data displayed by most websites can only be viewed using a web browser. They do not offer the functionality to save a copy of this data for personal use. The only option then is to manually copy and paste the data — a very tedious job which can take many hours or sometimes days to complete. Web Scraping is the technique of automating this process, so that instead of manually copying the data from websites, the Web Scraping software will perform the same task within a fraction of the time.( http://www.webharvy.com/)

After we know what web scraping is, we will start web scraping with the R studio application. At this time we will be scraping the Tripadvisor website. TripAdvisor, Inc. is an American travel and restaurant website that shows hotels and restaurant reviews, accommodation and other travel-related content. It also includes interactive travel forums. TripAdvisor is an early adopter of user-generated content.

let’s try it,

you can use any web or use the same web as me https://www.tripadvisor.com/Airline_Review-d8729079-Reviews-Cheap-Flights-Garuda-Indonesia#REVIEWS here we use web scraping to find out reviews about Garuda Indonesia airlines. FYI Garuda Indonesia is Indonesia’s national airline.

For the first you can try this

library(xml2)
library(rvest)
url<-read_html(“https://www.tripadvisor.com/Airline_Review-d8729079-Reviews-Cheap-Flights-Garuda-Indonesia#REVIEWS")

why do we use the xml2 and rvest functions? xml2 comes bundled with a number of sample files in its ‘inst/extdata’ directory. This function makes them easy to access. rvest is new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces.

npages<-url%>%
html_nodes(“ .pageNum”)%>%
html_attr(name=”data-page-number”)%>%
tail(.,1)%>%
as.numeric()
npages

with the above coding, it will display how many web pages are discrete

a<-0:(npages-1)
a
b<-10
b
res<-numeric(length = length(a))
res
for (i in seq_along(a)) {
res[i]<-a[i]*b
}

tableout<-data.frame()

for (i in res) {
cat(“.”)

This text is commonly used for web scraping where the web has several web pages, in this case scraping will overtake the url to be taken.

url<-paste(“https://www.tripadvisor.com/Airline_Review-d8729079-Reviews-Cheap-Flights-or",i,"Garuda-Indonesia#REVIEWS",sep = “”)

reviews<-url%>%
html()%>%
html_nodes(“#REVIEWS .innerBubble”)

id<-reviews%>%
html_node(“.quote a”)%>%
html_attr(“id”)

quote<-reviews%>%
html_node(“.quote span”)%>%
html_text()

rating<-reviews%>%
html_node(“.rating .ui_bubble_rating”)%>%
html_attrs()%>%
gsub(“ui_bubble_rating bubble_”,””, .)%>%
as.integer() / 10

date<-reviews%>%
html_node(“.innerBubble, .ratingDate”)%>%
html_text()

review<-reviews%>%
html_node(“.entry .partial_entry”)%>%
html_text()

flights<-reviews%>%
html_node(“.categoryLabel”)%>%
html_text()

reviewnospace<-gsub(“\n”,””,review)
temp.tableout<-data.frame(id,quote,rating,date,reviewnospace,flights)
tableout<-rbind(tableout,temp.tableout)
}

Tableout of Scrapping
Tableout of Scrapping

in this case, web scraping using the gadget selector to retrieve the object inspector in the form:
#REVIEWS .innerBubble; .quote a; .quote span;
.rating .ui_bubble_rating;
.innerBubble, .ratingDate;
.entry .partial_entry; .categoryLabel;

from this I will get the id, quote, rating, date, review, flights from the results of reviews of airline airline passengers, this can be seen in the picture.

View(tableout)
write.csv(tableout,”E://BIML//scrap.csv”)
save.image()

to save to your computer about a review we can write the syntax above where you can save in your computer directory, file name and file format.

Tableout Scraping

Enough to get here about web scraping, Wait for my article with another discussion. Cheerio.

--

--