Tutorial: Amazon price tracker using Python and MongoDB (Part 1)
A two-part tutorial on how to create an Amazon price tracker.
Recently there was an Amazon sale and I wanted to buy a product that I had been checking on for a long time. But, as I was ready to buy it I noticed that the price had increased and I wondered if the sale was even legitimate. So I figured by creating this price tracker app, it would not only increase my fluency in python but I would also have my very own home-brewed app to track amazon prices.
While I have been programming for a long time, it is only recently that I have picked up python but it’s a bliss so far. If any of you python experts find my code to be not very “Pythonic”, my apologies, I will learn more :).
This tutorial assumes that you have at least basic knowledge of python. Also that you have Python and MongoDB installed on your system.
Note that this tutorial is meant to demonstrate on how to create an amazon price tracker and not to teach programming.
So, without further ado let us begin part 1 of this tutorial.
This tutorial is outdated as of 2022. I will be creating a newer version very soon.
Step 1: Creating files and folder for the project
- Open whichever directory you like and create a folder, name it amazon_price_tracker or just anything you want.
- Now open the folder and create two files
scraper.py
anddb.py
.
That’s all for the first step, now open the terminal in the projects directory and head to the next step.
Step 2(Optional): Creating a virtual environment with virtualenv
This is an Optional step to isolate the packages that are being installed. You can find more about virtualenv here.
Run this to create an environment.
$ virtualenv ENV
And run this to activate the environment.
$ source ENV/bin/activate
If you want to deactivate the environment then simply run the following.
$ deactivate
Now, activate the environment if you haven’t already and head to step 3.
Step 3: Installing the required packages.
- Run this command to Install requests (a library to make HTTP requests)
$ pip install requests
- Run this command to Install BeautifulSoup4 (a library to scrape information from web pages)
$ pip install bs4
- Run this command to install html5lib(modern HTML5 parser)
$ pip install html5lib
- Run this command to install pymongo (a driver to access MongoDB)
$ pip install pymongo
Step 4: Starting to code the extract_url(URL) function
Now, open scraper.py
and we need to import a few packages that we had previously installed.
import requests
from bs4 import BeautifulSoup
Now, let us create a function extract_url(URL)
to make the URL shorter and verify if the URL is valid www.amazon.in URL or not.
This function takes an Amazon India URL such as:
and converts them to shorter URL https://www.amazon.in/dp/B07HGJJ58K which is more manageable. Also if the URL is not valid www.amazon.in URL then it would return a None
Step 5: What we need for the next function
For the next function Google “my user agent”, copy your user agent and assign it to variable headers.
headers = { "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"
}
Before we create a function to scrap details from the page let us visit an Amazon product page like this one and find the elements that have the name and the price of the product. We will need the element’s id to be able to find the elements when we extract the data.
Once the page is rendered, do a right-click on the name of the product and click on “inspect” which will show the element which has the name of the product.
We can see that the <span>
element with id=“productTitle”
now hold on to this id, we will use it later to scrap the name of the product.
We will do the same for the price, now right-click on the price and click on inspect.
The <span>
element with id=“priceblock_dealprice”
has the price that we need. But, this product is on sale so its id is different from a normal id which is id=“priceblock_ourprice”
.
Step 6: Creating the price converter function
If you look closely the <span>
element has the price but it has many unwanted pieces of stuff like the ₹ rupee symbol, blank space, a comma separator, and decimal point.
We just want the integer portion of the price, so we will create a price converter function that will remove the unwanted characters and gives us the price in integer type.
Let us name this function get_converted_price(price)
With some simple string manipulations, this function will give us the converted price in integer type.
UPDATE: As mention by @oorjahalt we can simply use regex to extract price.
NOTE: While tracker is meant for www.amazon.in, this may very well be used for www.amazon.com or other similar websites with very minor changes such as:
To make this compatible with the global version of amazon simply do this:
- change the ₹ to $
stripped_price = price.strip("$ ,")
- we can skip
find_dot
andto_convert_price
entirely and just do this
converted_price = float(replace_price)
We would, however, will be converting the price to a float type.
- And changing www.amazon.in to www.amazon.com in
extract_url(URL)
function
This would make it compatible with www.amazon.com.
Now, as we buckle up we can finally proceed towards creating the scraper function.
Step 7: Onto the details scraper function
OK so let us create the function that would extract the details of the product such as its name, price and returns a dictionary that contains the name, price and the URL of the product. We will name this function get_product_details(URL)
.
The first two variables for this function are headers
and details
, headers
which will contain your user-agent and details
is a dictionary that will contain the details for the product.
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"
}
details = {"name": "", "price": 0, "deal": True, "url": ""}
The another variable _url
will hold the extracted URL for us and we will check if the URL is valid or not. An invalid URL would return None
, if the URL is invalid then we will set the details to None
and return at the end so that we know something is wrong with the URL.
_url = extract_url(url)
if _url is None:
details = None
Now, we come to the else
part. This has 4 variables page
, soup
, title
and price
.
page
variable will hold the requested product’s page.
soup
variable will hold the HTML, with this we can do lots of stuff like finding an element with an id and extract its text, which is what we will do. You can find more about other BeautifulSoup’s function here.
title
variable as the name suggests will hold the element that has the title of the product.
price
variable will hold the element that has the price of the product.
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, "html5lib")
title = soup.find(id="productTitle")
price = soup.find(id="priceblock_dealprice")
Now that we have the elements for title and price, we will do some checks.
Let us begin with price, as mentioned earlier the id of price can be either id=”priceblock_dealprice”
on deals or id=”priceblock_ourprice”
on normal days.
if price is None:
price = soup.find(id="priceblock_ourprice")
details["deal"] = False
Since we are first checking if there is any deal price or not, the code will change price
from deal price to normal price and also set the deal to false
in details[“deal”]
if there is no deal price. This is done so that we know the price is normal.
Now, even then if we don’t get the price that means something is wrong with the page, maybe the product is out of stock or maybe the product is not released yet or some other possibilities. The following code will check if there are title and price or not.
if title is not None and price is not None:
details["name"] = title.get_text().strip()
details["price"] = get_converted_price(price.get_text())
details["url"] = _url
If there are price and title of the product then we will store them.
details["name"] = title.get_text().strip()
This will store the name of the product but, we have to strip any unwanted blank leading and trailing spaces from the title. The strip() function remove any trailing and leading spaces.
details["price"] = get_converted_price(price.get_text())
This will store the price of the product. With the help of the get_converted_price(price) function that we created earlier gives us the converted price in integer type.
details["url"] = _url
This will store the extracted URL.
else:
details = None
return details
We will set the details to None
if the price or title doesn’t exist.
Finally, the function is complete and here is the complete code
Note: While this code does not work for books since books have different
productid
, you can make it work for books if you tweak the code.
Step 8: Let us run scraper.py
At the end of the file add the following.
print(get_product_details("Insert an Amazon URL"))
Open the terminal where you have your scraper.py file and run the scraper.py like so.
$ python3 scraper.py
If you have done everything correctly you should get an output like this.
{‘name’: ‘Nokia 8.1 (Iron, 4GB RAM, 64GB Storage)’, ‘price’: 19999, ‘deal’: False, ‘url’: ‘https://www.amazon.in/dp/B077Q42J32'}
And voila we have completed Part 1 of the Amazon price tracker tutorial.
I will see you next week with the follow-up part 2 where we will explore MongoDB using PyMongo to store our data.
This was my first article on medium or blogging in general and I am excited to share my thoughts, experiments and stories here in the future. Hope you’ll like it.
Find the complete source code for this article below.
Follow the link below for part 2.