Have you searched for a flight, found a good price, made the same search the next day and the price had doubled? Flight prices are really dynamic, and if you are doing some traveling, you might get crazy when trying to understand how and why the prices are varying as they do.
Even if you can get crazy, I find this dynamic really interesting. Thus I wanted to track the flight prices for some routes interesting to me over a period of time. To make the same searches every day and manually enter data was not that appealing to me, so I investigated how I could build my own web scraper to do this automatically. In this post I will share how I did this.
1. Prerequirements
To get along you need some basic R-knowledge and basic understanding of html. That’s it. My go-to tool for data analysis is R, and thus I wanted to do the scraping with R. I used the following R-libraries:
library(RSelenium) # Use browser with R-Code (e.g firefox)
library(rvest) # Scrape data from htmllibrary(stringr) # Handle strings
library(tidyr) # Data cleaning
library(dplyr) # Data cleaning
- RSelenium enables to control and navigate a browsing session from within R and with R-code
- Rvest is used to collect the data (html and xml) from a webpage
- Tidyr and dplyr are awesome for data cleaning and preparation , and stringr is great to use when handling strings.
2. Find a suitable webpage to scrape
Some webpages are hard to scrape because of regularly changing html-structure, use of CAPTCHA etc. I wanted to scrape one of the OTAs that collects prices from several airlines, and found out that expedia seemed pretty straight forward to scrape.
3. Breakdown of the URL
Now, have a look at the URL. To make the scraping dynamic (so we can loop over several dates, destinations etc. ), we must be able to dynamically create the string for the URL. The orange colored text is the parts that can replace with variables: Date, origin and destination and finally also carrier (which also can be left out). You see that origin and destination contains some more text in Swedish (alla flygplatser = all airports) — depending on your language this will look slightly different.

I created a text-file with the strings to use as origin and destination.
The dates variables can be created direct in R to get relevant values depending on current date.

4. Investigate the webpage to scrape
Now it is time to have a look at what parts on the page that should be collected. Mark the parts you want to scrape, right click and select “Inspect element”. Information about the elements are shown. For example, if we want to get the price, just hoover over price and you’ll see that the information is found in the element ‘span.full-bold’.

5. Write the R-Code
The URL and web page element are known — now it is time to translate this to R-code. The code snippet below starts the browsing session from within R and navigates to the URL. I’ll hold the code for 15 seconds so the page has time to load before getting the page source.
# Start server
driver <- rsDriver(browser=”firefox”)
remDr <- driver[[“client”]]# Goto url and hold for 15 seconds
remDr$navigate(url)
Sys.sleep(15)
# Get source code and read html
page <-
remDr$getPageSource()
The information from the page source is collected with read_html and html_nodes, resulting in the data frame below
text_xml <- read_html(page[[1]])# Get prices
prices <-
text_xml %>%
html_nodes(‘span.full-bold’) %>%
html_text()> head(prices)
.
1 1 174 kr
2 1 174 kr
3 1 174 kr
4 1 382 kr
5 2 022 kr
6 2 022 kr
Other information (airline, time etc) can be collected with the same logic. This step might be more complicated depending on the structure of the webpage.
Combined with data from the variables used in the url, my final data set looks like this:
> glimpse(df)
Observations: 42
Variables: 8
$ price <fct> 1 927 kr, 2 250 kr, 2 435 kr, 2 435 kr, 2 435 kr, 2 940 kr, 2 940 kr, 3 542 kr, 3 542 kr,...
$ airline <chr> "Air Baltic", "Air Baltic", "SAS", "SAS", "SAS", "SAS", "SAS", "SAS", "SAS", "SAS", "SAS"...
$ Origin <chr> "BER", "BER", "BER", "BER", "BER", "BER", "BER", "BER", "BER", "BER", "BER", "BER", "BER"...
$ Destination <chr> "STO", "STO", "STO", "STO", "STO", "STO", "STO", "STO", "STO", "STO", "STO", "STO", "STO"...
$ DepartureDate <date> 2019-11-03, 2019-11-03, 2019-11-03, 2019-11-03, 2019-11-03, 2019-11-03, 2019-11-03, 2019...
$ ReturnDate <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ ScrapeDate <date> 2019-10-29, 2019-10-29, 2019-10-29, 2019-10-29, 2019-10-29, 2019-10-29, 2019-10-29, 2019...
$ nbr_trials <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...If you let this script run for a couple of days, you can put the resulting dataframes together and analyze the variations in flight prices.
Google sheet with my current data:
https://docs.google.com/spreadsheets/d/12lzKj1t0g1maoO6FwX8c8lmehD95mt-7fb7zJ3VrFiU/edit#gid=2026283149
Github-project: https://github.com/msjoelin/flight-scraping
