Using Python to Find the Cheapest Air Tickets and their Timings

Akshit Behera
Analytics Vidhya
Published in
6 min readDec 25, 2020

--

A handy guide on using web-scrapping to find the cheapest airline tickets over a time period for specific departure times.

Photo by Cam DiCecca on Unsplash

Travel aggregators over the years have done a great job of creating common platforms from where one can choose a plethora of flight and hotel options. This has enabled more flexibility for the users who can now book as per their preferred timings as well as the airlines. In fact the user interface for sites over the years has also become a lot more friendly for the customer to be able to sort and filter as per personal preferences.

Source: www.cxodaily.com

During my study of such sites over the last few months, where I have spent some time planning post COVID trips, I figured there was one critical feature missing from most of these sites. The websites generally offer to sort the options by price (lowest to highest) over a given day. Some websites even display the cheapest options by day in the coming days in the date selection panel. However, a major concern during times like COVID-19 where many states have night curfews, or a scenario from the point of safety as well, is that one would like to avail the cheapest option during the period of preferred time slots. No website as of now is offering the option to check for cheapest flights over the next few days during a particular time slot. This feature has become important in the current times and could be of great benefit to the customers wanting to finalize travel itineraries. So, till the time such feature is offered on the travel websites, we will try and build this on Python, using Selenium.

The Scenario

Suppose one wants to fly from Delhi to Mumbai and is flexible with travel dates for about a week, but particular about the time of departure being between 3 pm to 6 pm.

Approach

WARNING! First and foremost, if you are also planning to build this tool, please refer to the robots.txt of the respective website and check what all is allowed to be extracted. In case the website does not allow scrapping of what you need, please send a mail to the web administrator before proceeding.

For our analysis, we will select one of the biggest online travel aggregators in India, MakeMyTrip. MakeMyTrip is a NASDAQ listed company, providing online travel services including flight tickets, domestic and international holiday packages, hotel reservations, rail, and bus tickets. To scrap the website, we first import the necessary libraries:

import selenium
from selenium import webdriver as wb
import pandas as pd
import numpy as np
import datetime
import time
from datetime import date
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

Setup the working directory to the path where the chromedriver is saved & then chromedriver as follows: -

%cd "PATH WHERE YOUR CHROMEDRIVER IS SAVED"
driver = wb.Chrome("YOUR PATH\\chromedriver.exe")

Since we want to check the flight prices for a week, we have randomly chosen a week in the first week of January (4th — 10th January 2021). We setup our day, month and year variables.

month = [1,1,1,1,1,1,1]
month = [str(x).zfill(2) for x in month]
day = [4,5,6,7,8,9,10]
day = [str(x).zfill(2) for x in day]
year = [2021,2021,2021,2021,2021,2021,2021]
year = [str(x).zfill(4) for x in year]

Now, in order to be able to extract information from the website, we first need to study the url. The url for a flight search from Delhi to Mumbai for 4th January 2021 shows up as follows

https://www.makemytrip.com/flight/search?tripType=O&itinerary=DEL-BOM-04/01/2021&paxType=A-1_C-0_I-0&cabinClass=E&sTime=1608889546521&forwardFlowRequired=true&mpo=&intl=false

The above url gives us a clear indication of where we need to input the variables of day, month and year created above. So we will call the website as follows: -

for a,b,c in zip(day,month,year):
driver.get("https://www.makemytrip.com/flight/search?tripType=O&itinerary=DEL-BOM-{}/{}/{}&paxType=A-1_C-0_I-0&cabinClass=E&sTime=1597828876664&forwardFlowRequired=true".format(a,b,c))

A major challenge that we face (from a crawling perspective) on websites like Instagram and Facebook is that of infinite scrolling, ie, data continually loads only and as you scroll down. Our code will be able to extract only the portion visible by default and there is a risk of potential loss of data. For Makemytrip, we face something similar where all the listed flights will not load till you scroll down and this is something we will have to incorporate in our code.

lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
match=False
while(match==False):
lastCount = lenOfPage
time.sleep(1)
lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
if lastCount==lenOfPage:
match=True

Now we are ready to extract the flight name, price, source, destination, duration, departure time, arrival time and date for all the flights during our preferred period and save the information to an empty dataframe called ‘Flight_Prices’.

# Creating an empty dataframe called 'Flight_Prices'
Flight_Prices = pd.DataFrame()
# Creating a for loop which will iterate through all the required pages of the website
for a,b,c in zip(day,month,year):
driver.get("https://www.makemytrip.com/flight/search?tripType=O&itinerary=DEL-BOM-{}/{}/{}&paxType=A-1_C-0_I-0&cabinClass=E&sTime=1597828876664&forwardFlowRequired=true".format(a,b,c))
time.sleep(15)

lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
match=False
while(match==False):
lastCount = lenOfPage
time.sleep(1)
lenOfPage = driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
if lastCount==lenOfPage:
match=True

time.sleep(60)
# Extracting all the Airline names using xpath
FlightName_elements = driver.find_elements_by_xpath("//div[@class='pull-left airways-info-sect']")
FlightName_elements = [x.text for x in FlightName_elements]
FlightName = [x.split('\n')[0] for x in FlightName_elements]
FlightName = pd.Series(FlightName)

# Extracting all the prices using xpath
Price_elements = driver.find_elements_by_xpath("//span[@class='actual-price']")
Price = [x.text for x in Price_elements]
Price = [i for i in Price if i]
Price = pd.Series(Price)

# Extracting all the Source City details using xpath
Fromcity_elements = driver.find_elements_by_xpath("//p[@class='dept-city']")
Fromcity = [x.text for x in Fromcity_elements]
Fromcity = pd.Series(Fromcity)

# Extracting all the Destination City details using xpath
Tocity_elements = driver.find_elements_by_xpath("//p[@class='arrival-city']")
Tocity = [x.text for x in Tocity_elements]
Tocity = pd.Series(Tocity)

# Extracting all the Duration details using xpath
Duration_elements = driver.find_elements_by_xpath("//p[@class='fli-duration']")
Duration = [x.text for x in Duration_elements]
Duration = pd.Series(Duration)

# Extracting all the Departure time details using xpath
Deptime_elements = driver.find_elements_by_xpath("//div[@class='dept-time']")
Deptime = [x.text for x in Deptime_elements]
Deptime = pd.Series(Deptime)

# Extracting all the Arrival Time details using xpath
Arrtime_elements = driver.find_elements_by_xpath("//p[@class='reaching-time append_bottom3']")
Arrtime = [x.text for x in Arrtime_elements]
Arrtime = [x.split("+", 1)[0] for x in Arrtime]
Arrtime = pd.Series(Arrtime)

Date_elements = driver.find_elements_by_xpath("//div[@class='item blue_active']")
Date_elements = [x.text for x in Date_elements]
x = [x.split(',', 1)[1] for x in Date_elements]
Date = [i.split('\n', 1)[0] for i in x]
Date = pd.Series(Date)

# Combining all the lists into a dataframe called 'df'
df = pd.DataFrame({'Date':Date,"Airline":FlightName,"From City":Fromcity, "To City":Tocity, "Departure Time":Deptime,"Arrival Time":Arrtime,"Flight Duration":Duration,"Price":Price})

# We will append the results obtained from every page into the empty datafram created earlier called 'Flight_Prices'
Flight_Prices = Flight_Prices.append(df)

Flight_Prices[Flight_Prices.Date==""] = np.NaN
Flight_Prices.Date = Flight_Prices.Date.fillna(method='ffill')
Flight_Prices.Price = Flight_Prices.Price.str.replace(",","").str.extract('(\d+)')
Flight_Prices

We end up with a beautiful dataframe having 788 rows of flight options and various associated features. Now before moving onto the critical step, we have to clean up our data a bit. Firstly, we need to ensure that our ‘Price’ column is a numeric value

Flight_Prices['Price'] = pd.to_numeric(Flight_Prices['Price'])

We then filter out the airline options having departure times only between 3 pm and 6 pm.

Flight_Prices = Flight_Prices[(Flight_Prices['Departure Time']>='15:00') & (Flight_Prices['Departure Time']<= '18:00')]

Finally, we group by date and find the flights with cheapest prices on all the dates of the week, and get the desired result.

Flight_Prices.loc[Flight_Prices.groupby('Date')['Price'].idxmin()]
Flight_Prices.drop_duplicates('Date')

Plotting the same on a line chart.

This exercise helps us to make a decision. We see that the cheapest flight in our preferred time slot is available on 10th January. However, it seems to be a connecting flight as the flight duration is of 17.5 hours. Hence, we would go with the next best option, which is available on 9th January. This is almost 21% cheaper than the cheapest flights available on all the other dates during the week.

Conclusion

We saw during this exercise that Python can indeed be very useful in developing small hacks, which can aid day to day decision making. The process can be as simple as scrapping data from relevant websites, cleaning the data as per our needs and visualizing to enable data driven decisions. The post outlines the main codes used in the process, and the full code can be found here.

--

--