Find the best Airpods deal on Prime Day via web scraping step by step in 5 minutes: learn how to scrape Amazon page
How do you find the best deal on amazon via simple web scraping? This article utilizes the Amazon product search page for Apple Airpods as a guide to show you how to scrape data from the web in less than 5 minutes.
I’m writing this piece the night before 2020, Amazon Prime day. Like everyone else, I’m anxiously waiting and praying to get a significant discount on a lot of must-haves. How do we effectively monitor the prices changes and get the most up to date information on the product you are waiting for?
Let’s build a simple Web Scraping Python Script to help us do that. (Code will be available in my GitHub repository)
Step I: Open Amazon and search for an item of interest.
In this case, I need to buy a new Airpods. Copy the URL from the browser
Step II: Import packages in Jupyter Notebook or other Python IDE
First, importing the Request and BeatifulScoup library in the workspace. Request library helps us request HTML data from the Internet Server. BeatifulSoup is a powerful library that enables us to clean and better locate specific items in the HTML pull.
I am also importing Pandas and NumPy for data manipulation.
Then copy the URL from the browser, paste it into requests.get() method. This will pull the HTML data from the Amazon.com Web Server.
If you wonder what does HTML data looks like, you can print it using r.text
Very messy data. We need to use the BeatifulSoup library to remove some tags. Let’s initiate a BeatifulSoup object in the code below.
Step III: Inspect the page to find all relevant data tags on the webpage
Use Ctrl + Shift+ I to inspect the title of any product on the page.
The highlighter will help you find the <div class = ‘….’ >. Copy the class name and paste it in the soup.find_all() method. This method will find all product data on the page.
You can use the prettify() method to view a more structured code: Here, I’m looking at the second item of the page using slicer [1].
Next, let’s scrape the discount price and other data.
Here I would like to scrape the discounted price. The highlighter shows it below to the tag:
<span class=”a-offscreen”>$124.00</span>
All we need to do is copy the class name to the select_one() method. We can print out the text use the code below.
We do this for all the field of interests: Product Name, Discount Price, Market Price, Rating, and Number of Reviews
Step IV: Collect price and other data of ALL product listing on the page
Finally, We can iterate through all the listing products on the page using a simple For loop.
All this is doing is going through each listing and grab the information that we are interested in.
Step V: Put everything together and find the best deal
In the end, I would like to create a Pandas DataFrame to clean and visualize our data. Also, I would like to put the data in its correct format and handle any null values. Finally, we can find the best deal with the most generous discount.
Discount = Market Price — Current Discount Price
Here, I am doing some data engineering to create a new discount column and clean up the data. Finally, I sort the data based on the discount amount
Here is the final result:
So which one is the best deal based on the discount?
We can see that Airpods with a wireless charging case currently have the highest discount of $52.8. The second best deal is Airpod Pro, with a discount of $50.
Part VI: Conclusion
In this article, we looked at using BeatifulSoup and Request Libary to scrape the Amazo.com for Airpods.
- We Opened the URL of interest
- We import packages in the Jupyter notebook
- Then, we inspect the page to find all relevant data tags on the webpage
- After that, we collect price and other data of ALL product listing on the page
- Finally, we work on data engineering and look at the best deal based on a discount. The top 2 deals are Airpods with a wireless charging case and AirPods Pro.
Code and more detailed analysis can be found at my Github repository: link.