How to Scrape Flight Prices with Python using Selenium
In this post, you will learn to scrape Expedia flight prices with Selenium and ChromeDriver
This article has been written by Roee Freund and Offir Inbar
Background
Stuck at home during the COVID-19 outbreak? Do you love traveling and data science?
You can combine both of them to buy cheap flights. This tutorial will teach you how to create a web scraper that will work for you to find flight prices for any destination you wish.
In this tutorial, you will scrape Expedia, which is known as one of the biggest Online Travel Agencies (OTA) in the world. it’s owned and operated by Expedia Group, ranked first on the list of top-earning travel companies.
If you want to see how the COVID-19 outbreak influenced flight prices, take a look at our Medium post.
It’s important to note that Web scraping is against most websites’ terms of service so your IP address may be banned from the website.
Environment setup
For this Scraper, make sure you are using Python 3.5 version (or newer). Confirm your environment contains the following packages and drivers:
- Selenium package: a popular web browser automation tool
- ChromeDriver: allows you to open a browser and perform tasks as a human being.
- pandas package
- DateTime package
This TDS post is a great introduction to Selenium.
Scraping Strategy
Before getting into the code, Let’s briefly describe the scraping strategy:
- Insert into a CSV file the exact routes and dates you want to scrape. One can insert as many routes as you want but it’s important to use these columns names. the scraper works only for Roundtrips.
- Run the full code.
- The output for each flight is a CSV file. Its file name will be the date and time that the scraping was performed.
- All flights of the same route will automatically be located by the scraper in the appropriate folder (the name of the route).
Sounds complicated… it not! Let’s go over an example.
These are scraped routes as defined in the CSV file (the scraper creates those folders automatically):
Here you can see the multiple scraped dates in Athens — Abu Dhabi route:
The screenshot below demonstrates a single CSV file for each scraping sample for Athens — Abu Dhabi route. Its name represents the date and time that the scraper has been executed.
We hope the process is clear!
Output Features
The scraper output will include these features:
- Departure time
- Arrival time
- Airline
- Duration (flight duration)
- Stops (number of stops)
- Layovers
- Price
- Airplane type
- Departure Coach
- Arrival Coach
- Departure airport name
- Arrival airport name
- The exact time the scraping was performed
If it’s not a direct flight, the scraper will give you additional data (airport, airline name, Etc.) for each connection.
Code
In this section, we will go over the main parts of the code. You can find the full code Here
Firstly, as usual, one will need to import the relevant libraries, define the chrome driver and set round trip type.
In the next step, you will define a few functions using Selenium tools in order to find various features on the webpage. The function’s names imply about their role.
Each row (flight) in the CSV routes file goes over the following process:
Its time to collect the data from the web and insert it into the Pandas DataFrame.
Finally, export the data to CSV file directly to the desired folder.
Conclusion
In summary, you learned how to scrape Expedia flight prices with the Selenium package. After understanding the basics, you will be able to build your scraping tool for any other website.
Personally, As encaustics Travelers, we love to use data science to find great deals to amazing destinations around the globe.
Don’t forget to connect with Roee and Offir on Linkedin if you have any questions, comments or concerns.
Good luck!
References
https://en.wikipedia.org/wiki/Expedia
https://www.scrapehero.com/scrape-flight-schedules-and-prices-from-expedia/