Web Scraping E-commerce sites using Selenium & Python

A detailed guide for scraping amazon.in using python

Chayan Bhattacharya

Published in

Analytics Vidhya

11 min readAug 18, 2020

Introduction

This article is written as a guide for an end to end scraping of data from amazon.in website starting from product links from listing pages to each product data from their individual pages. I have explained every detail as simple as possible with the intention that anyone can use the logic used here to scrape different e-commerce sites as these sites are much trickier to scrape compared to traditional HTML sites. Please read the Terms & Conditions carefully for any website whether you can legally use the data. This is for educational purposes only.

What is Web scraping?

Web scraping is the process of extracting data from websites. Unlike the traditional way of extracting data by copying and pasting, web scraping can be automated by using programming languages like python by defining some parameters and retrieving data in a shorter time.

Features of Selenium

Selenium makes it very easy to interact with dynamic websites with its bucket of useful features. It helps in identifying and extracting elements from websites with the help of function like

find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector

Just by adding an extra “s” with the “element”, we can extract a list of elements. There are other features too but we will be mostly using these.

Prerequisites

Knowledge of Python
Basic knowledge of HTML although it is not necessary

Installation

Anaconda: Download and install it from this link https://www.anaconda.com/ . We will be using Jupyter Notebook for writing the code
Chromedriver — Webdriver for Chrome: Download it from this link https://chromedriver.chromium.org/downloads. No need of installing, just copy the file in the folder where we will create the python file. But before downloading, confirm that the driver‘s version matches that of the Chrome browser installed.

3. Selenium: Install selenium by opening Anaconda prompt and type the code below and press enter

pip install selenium

Alternatively, you can open MS command prompt and type the code and press enter

python -m pip install selenium

Downloads

Here is the link to my GitHub repository https://github.com/chayb/web-scraping. I would recommend to download the file and follow the article simultaneously.

Importing the libraries

We start by importing the following libraries

import selenium
from selenium import webdriver as wb
import pandas as pd
import time

Explanation:

Selenium is used for browser automation and helps in locating web elements in the website code
Pandas is a data analysis and manipulation tool which will be used for saving the extracted data in a DataFrame
The time library is used for several purposes but we will be using it for delaying the code execution here

Starting up the Browser

First, we will be scraping product links of Smart TVs from the listing pages of Amazon.in. Then we will scrape the product data from each of the product pages.

Our starting URL is https://www.amazon.in/s?bbn=1389396031&rh=n%3A976419031%2Cn%3A%21976420031%2Cn%3A1389375031%2Cn%3A1389396031%2Cn%3A15747864031&dc&fst=as%3Aoff&qid=1596287247&rnid=1389396031&ref=lp_1389396031_nr_n_1

So let’s open our Chrome browser and the designated starting URL by running this code

wbD = wb.Chrome('chromedriver.exe')
wbD.get('https://www.amazon.in/s?bbn=1389396031&rh=n%3A976419031%2Cn%3A%21976420031%2Cn%3A1389375031%2Cn%3A1389396031%2Cn%3A15747864031&dc&fst=as%3Aoff&qid=1596287247&rnid=1389396031&ref=lp_1389396031_nr_n_1')

The browser should open to a webpage similar to this. (The webpage may vary as Amazon changes its product listing with time)

Exploring and locating Elements

Now its time to explore the elements to find the links to the products.

Product Links Exploration

Generally, the name of the products are clickable and point to the respective product page. So, right-click anywhere on the page and click inspect to open the developer tools.

Now we need to right-click the product name to highlight the element in the developer tools window. We are looking for a URL which will be in the form of href = "url link" . These links point to their respective product pages. As we hover the mouse over the code, we see the title gets highlighted. We have also found the URL for the product page (see the image below).

Extract elements using Class_name or Id

Now, we cannot directly extract this link. We need to find a “class” or “Id” attributes which are like containers for similar product links. We can see that the href is inside a <a> tag which is just below a <h2> tag of class name “a-size-mini a-spacing-none a-color-base s-line-clamp-2”. The indent of the <a> tag shows that it is a child of the <h2> tag element.

We can use this class name to try to extract the links. The class name has a lot of whitespaces, so we can use the part just before the first whitespace i.e. “a-size-mini”. You can also use the complete name by replacing the whitespace by periods(.).

Let us check if we can extract the links by class name. (Tip: As soon as you type wbD.find, you can click the Tab key on the keyboard to get a list of commands available)

Selenium has a function called “find_elements_by_class_name”. Using that we will try to extract all the elements in the source code and store it inside the “productInfoList” variable. We will check if the no. of elements extracted is equal to the no. of product listings on the page.

productInfoList = wbD.find_elements_by_class_name('a-size-mini')
len(productInfoList)Output: 30

The len()is used to check the number of elements we got by class name stored in “productInfoList”. The output is 30, although the page shows 24 product listings.

Note: This output can vary from time to time as amazon modify their listing. Also if you using an antivirus like Kaspersky, they block the sponsored ads from view by default and hence the output number may vary. There is another issue due to antivirus which I will discuss later.

This means that some of the extra objects extracted either do not contain any links or contain sponsored links. We need to verify that. We see the href is within an <a> tag (see image above). Since the “productInfoList” is a list that contains many elements, we can extract each by an index number. Here we are taking index=0. To extract the data inside and also the href property we can first use

pp2=productInfoList[0].find_element_by_tag_name('a')

After executing the command we got an error,

StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
  (Session info: chrome=84.0.4147.105)

which proves that there are no <a> tags inside and hence no links. We tried it from 0 to 4, where we observed that index 2 has a <a> tag inside and the rest didn’t.

productInfoList[2].find_elements_by_tag_name('a')Output:
[<selenium.webdriver.remote.webelement.WebElement (session="64b5966ffc556de741895f1a68da5ff3", element="ff23ee89-d025-422d-830b-e297a15bbe23")>]

But whether it has a product link or not can be confirmed by printing the text inside which should be the ‘name of the TV’. If it returns a blank, then it proves that it is redundant.

productInfoList[2].textOutput: ''

The output is empty which proves that the element is not useful.

Note: Remember I spoke of an issue, it has been observed that if you are not using any antivirus or adblocker, then some of the above elements contain a text called “Sponsored” instead of a ‘ ’. So later on while capturing the links we need to ignore such elements also.

But index 5 gave a link which is not in the product list which proves it is a sponsored link. We will scrape these links too and we can remove it later from the excel sheet if needed at the end. Next, let us try the index ‘6’.

pp2=productInfoList[6].find_element_by_tag_name('a')
pp2.get_property('href')Output: 'https://www.amazon.in/Mi-inches-Ready-Android-Black/dp/B084872DQY/ref=sr_1_3?dchild=1&fst=as%3Aoff&qid=1596287247&rnid=1389396031&s=electronics&sr=1-3'

No error. This gives the link for the first product on the page i.e. Mi TV 4A. Hence index numbers from 6 to 29 (index start from 0) are our product links. This gives the 24 products listed on the page. Now let us move to the next part i.e. to loop through page nos. to extract all the product links on all pages.

Pagination

We need to locate the next button on the first page and inspect the element by right-clicking as we did previously.

We can see the next button has an href link and is inside a class called ‘a-last’. Hence like we did before let us extract the element by class name. But we do not need the href, instead, we will click the button with the help of Selenium.

wbD.find_element_by_class_name('a-last').click()

After executing the code, the website moves to the next page. Hence we are on the right path. We generally write the code inside a while loop so that when the last page is reached, then clicking the next button throws an error and the loop is terminated (using Try: & Except:). But here the case is different which I will explain next. So let us now write the complete code to extract the links and also move to the next page

listOflinks =[]
condition =True
while condition:
    time.sleep(3)
    productInfoList=wbD.find_elements_by_class_name('a-size-mini')
    for el in productInfoList:
        if(el.text !="" and el.text !="Sponsored"):
            pp2=el.find_element_by_tag_name('a')
            listOflinks.append(pp2.get_property('href'))
    try:
        wbD.find_element_by_class_name('a-last').find_element_by_tag_name('a').get_property('href')
        wbD.find_element_by_class_name('a-last').click()
    except:
        condition=False

Explanation:

time.sleep(3) helps to delay the execution of the next line of code by 3 seconds. This is done as sometimes the webpage spends some time to load and if the code start searching for elements before the page loads, it will give an error. You can adjust the time as per requirement. Also if you make too many requests in a short time, the website might block your IP address.
for is used to store each href link in the variable “listOflinks” using an index
Try: & Except: : Try will always be executed until the code inside it throws an error. The code “wbD.find_element_by_class_name(‘a-last’).click()” is used for clicking the next button. After all the pages are clicked, if we click the next button anymore, it will give an error which will make the condition false and the loop will break

Here we have encountered a special case.

If we click the next button right after all the pages have been scraped, it generally throws an error but here it does not throw an error, hence an infinite loop is created. We will use a workaround. So we check that on the last page the next button has no href link. Hence inside ‘Try:’, we are just checking whether there is any href link inside the next button, if no link is found it throws an error hence executing the code inside ‘Except:’ and breaks out of the loop.

Below is the complete code snippet to extract all the product links from the listing pages.

Extracting individual product data

Next, let us explore the individual product page and inspect the elements which we want to scrape. Here we will scrape the data by using Xpath. (Read more about Xpath here: https://www.guru99.com/xpath-selenium.html)

Importing the library tqdm to show the progression bar while scraping

from tqdm import tqdm

Let say we want to scrape

SKU (Product Name)
Price
Category
Brand
Model

So let us open a product page say the first one. Open the developer tools by Ctrl+Shift+I. Then right-click the product name and click inspect. We can see the code gets highlighted. Right-click the code and copy the XPath. (see image below)

To extract the product name, paste it within the quotes of the find_element_by_xpath function

sku = webD.find_element_by_xpath('//*[@id="productTitle"]').text

Similarly, we will do the same for Price & Category.

Extract the category name

category= webD.find_element_by_xpath('//*[@id="wayfinding-breadcrumbs_feature_div"]/ul/li[7]/span/a').text

In the case of price, we need to copy the Xpath which is

//*[@id="priceblock_ourprice"]

But it is observed that in some cases the price is given by a different Xpath which is,

//*[@id="priceblock_dealprice"]

and in some cases, no price data is given as the product is unavailable. Hence we have three conditions for extracting price data which can be written as

try:
        try:
            price = webD.find_element_by_xpath('//*[@id="priceblock_ourprice"]').text
        except:
            price = webD.find_element_by_xpath('//*[@id="priceblock_dealprice"]').text
except:
    price=""

Explanation:

First “priceblock_ourprice” will be checked, if fails, it will check “priceblock_dealprice” and if both fail, the variable ‘price’ will be empty.

Next, we need to scroll down below to the table where the product details are given.

In order to extract the Brand & Model information, we need to explore these elements too. As we inspect, we observe that each info such as Brand, Model, etc. are stored inside <tr> tags. These <tr> tags are all stored under a class attribute called “pdTab”.

Hence we need to extract all the <tr> elements from inside the class name = pdTab.

pp = wbD.find_element_by_class_name('pdTab')
pp1 = pp.find_elements_by_tag_name('tr')

Now we use a for loop to check inside each <tr> tags, whether the class=”label” matches with ours, then we will store the text inside class=”value”.

for el in range(len(pp1)-1):
   if (pp1[el].find_element_by_class_name("label").text) == 'Brand':
         brand= pp1[el].find_element_by_class_name("value").text
   if (pp1[el].find_element_by_class_name("label").text) == 'Model':
         model= pp1[el].find_element_by_class_name("value").text

Next, we store them inside a python dictionary temp and append it to a list alldetails.

All these codes will be placed inside a for loop which provides the product links we scraped earlier over each iteration.

After running the code, we can print the list as a DataFrame.

pd.DataFrame(alldetails)

Output:

We can export the DataFrame as .csv file.

data = pd.DataFrame(alldetails)
data.to_csv('Amazon_tv.csv')

Here is the complete code.

Conclusion

I hope I am able to succeed in my intention to teach you to scrape amazon.in & I hope you can use this knowledge to scrape any e-commerce site.

Thank you for reading. Happy coding :)

Web Scraping E-commerce sites using Selenium & Python

A detailed guide for scraping amazon.in using python

Introduction

What is Web scraping?

Features of Selenium

Prerequisites

Installation

Downloads

Importing the libraries

Starting up the Browser

Exploring and locating Elements

Product Links Exploration

Extract elements using Class_name or Id

Pagination

Explanation:

Here we have encountered a special case.

Extracting individual product data

Explanation:

Conclusion

Written by Chayan Bhattacharya