Web Scraping implementation on Amazon India website
This project is focused on searching and extracting data of any item from Amazon India’s ecommerce website.
Website — https://www.amazon.in/
This project does web scraping through Selenium and BeautifulSoup to provide the desired result. In this project I will be displaying only 1 type of product and extracted details of only 5attributes which are Name, Price, Rating, Number of reviews and URL of a product. The code can be modified according to the use and website being scraped. Once the data has been extracted a .csv file will be generated which the user can make use for shortlisting the preferred model or can use it for analytics on the data.
This project also includes some analysis on the dataset we have extracted. The data extracted is about the MSI Laptops. The analysis is done using Pandas, Matplotlib and Seaborn.
Let’s get started with the project —
We will be using Jupyter Notebook for this project.
Initially we need to install the packages important for the project:
- pip install selenium
- pip install bs4
Other basic packages are pre-installed for Jupyter Notebook.
Now depending on which browser you use, you need to install its driver.exe file. Refer to this link. This driver will enable us to get access to the website’s data.
I’ll be using Microsoft Edge for extraction. Therefore I’ll install msedgedriver.exe from link
Now, we’ll be coding for scraping function. Follow the following steps: -
Import Packages
Import the necessary packages
import csv
import os
from selenium import webdriver
from bs4 import BeautifulSoup
Web Driver
Now we need to specify the downloaded drivers execution path i.e., “location/msedgedriver.exe”, so that we can use it. It will automatically run the browser with empty page.
driver = webdriver.Edge(executable_path="D:/Downloads/Edge Downloads/edgedriver_win64/msedgedriver.exe")
Enter URL of Amazon India
.get() function takes takes URL as argument and opens the site.
url = "https://www.amazon.in/"
driver.get(url)
Generate URL for search item
We need to provide the name of item to be searched and combine it with the URL to be able to search.
We’ll use search_term variable to give the name of item and create a function to insert the name in URL.
def get_url(search_term):
"""Generate url with search term""" template = "https://www.amazon.in/s?k={}&ref=nb_sb_noss_2"
search_term = search_term.replace(" ", "+")
return template.format(search_term)
We are replacing spaces with “+” in search_term because in URLs there are no spaces and multiple word input is joined with this sign.
We can see URL for search_term = “laptop bags”
url = get_url('MSI laptops')
print(url)
Now we need to open this URL in browser.
driver.get(url)
Extract data
Now, we need to extract all the HTML code that is present in Page Source. We can also see HTML code by right clicking on the site and selecting View page source in the menu. But, copying all the code manually is not efficient so, we will use BeautifulSoup for this purpose.
soup = BeautifulSoup(driver.page_source, 'html.parser')
We are interested only in the variety of results related to our search_term so after analyzing the page source below tag is found to be suitable for extracting the relevant data.
<div data-component-type = “s-search-result”>
We will extract all the data which has this tag.
data_extracted = soup.find_all(‘div’, {‘data-component-type’: ‘s-search-result’})
The above code will extract the data from only first page. We’ll use this to loop over pages and extract data from each page in the further code.
The length of data_extracted is equal to the number of products on first page. But, this data may contain products which has not mentioned its price, rating or number of review. This may lead to an error which we’ll counter later.
len(data_extracted)
Data prototype
We need to have the basic idea about tags to be used for extraction of any specific data about a product. So we need to create a prototype for reference.
item_prototype = data_extracted[0]
Earlier in data_extracted list we had all the HTML code separated for every product but now we need only some specific details like price and ratings, which can lead us to any conclusion about a product. Therefore we will create a extract_record() function to help us with that.
def extract_record(item_prototype):
"""Extract and return data from single record"""
# name and url
atag = item_prototype.h2.a
name = atag.text.strip()
new_url = url[:-1]+atag.get('href')
# price
price_parent = item_prototype.find('span', 'a-price')
price = price_parent.find('span', 'a-offscreen').text # rating and review_count
rating = item_prototype.i.text
review_count = item_prototype.find('span', {'class': 'a-size base'}).text res = (description, price, rating, review_count, url)
return res
The above extract_record() function is suitable only if all the variables are assigned with some values. But, in some product there may not be price or reviews. So, we need to perform some error handling on these variables.
Error Handling
def extract_record(item_prototype):
"""Extract and return data from single record"""
# name and url
atag = item_prototype.h2.a
name = atag.text.strip()
new_url = "http://www.amazon.in"+atag.get('href')
try:
# price
price_parent = item_prototype.find('span', 'a-price')
price = price_parent.find('span', 'a-offscreen').text
except AttributeError:
return
try:
# rating and review_count
rating = item_prototype.i.text
review_count = item_prototype.find('span', {'class': 'a-size-base'}).text
except AttributeError:
rating = ''
review_count = 0
res = (name, price, rating, review_count, new_url)
return res
Now we need to create a list which will have all the details of every laptops separate as a list of lists.
The below loop will iterate over every product and retrieve the data into records list which eventually will become a list of tuples.
records = []
for item in data_extracted:
record = extract_record(item)
if record:
records.append(extract_record(item))print(record[0])
Navigating through pages
We need to navigate through all the pages to get the full data on any searched item.
https://www.amazon.in/s?k=laptop+bags&page=2&qid=1627206041&ref=sr_pg_1
In the link above, we can see that there is a page query in the URL and we will use that query to navigate through pages. Each query is concatenated with URL along with “&”
def get_url(search_term):
"""Generate url with search term"""
template = "https://www.amazon.in/s?k={}&ref=nb_sb_noss_2"
search_term = search_term.replace(" ", "+")
# Add page query to url
url = template.format(search_term)
# Add page to the placeholder
url += '&page={}'
return url
After running above function the query will look something like-
https://www.amazon.in/s?k=laptop+bags&ref=nb_sb_noss_2&page{}
Here we can pass any page number in the placeholder “{}”
Combining the code together
Below is the compilation of the above functions and assignments as per the necessary order. One can simply copy this code and run on there system if they have the packages installed.
import pandas as pd
import numpy as np
import os
from selenium import webdriver
from bs4 import BeautifulSoupdef get_url(search_term):
"""Generate url with search term"""
template = "https://www.amazon.in/s?k={}&ref=nb_sb_noss_2"
search_term = search_term.replace(" ", "+")
# Add search term to url
url = template.format(search_term)
# Add page query to the placeholder
url += '&page={}'
return urldef extract_record(item):
"""Extract and return data from single record"""
# name and url
atag = item.h2.a
name = atag.text.strip()
new_url = "http://www.amazon.in"+atag.get('href')
try:
# price
price_parent = item.find('span', 'a-price')
price = price_parent.find('span', 'a-offscreen').text
except AttributeError:
return
try:
# rating and review_count
rating = item.i.text
review_count = item.find('span', {'class': 'a-size-base'}).text
except AttributeError:
rating = ''
review_count = 0
res = [name, price, rating, review_count, new_url]
return resdef driverFunction(search_term, file_name): # File name should have .csv ending
DRIVER = webdriver.Edge(executable_path="D:/Downloads/Edge Downloads/edgedriver_win64/msedgedriver.exe")
DATA = []
URL = get_url(search_term)
for page in range(1,21):
DRIVER.get(URL.format(page))
SOUP = BeautifulSoup(DRIVER.page_source, 'html.parser')
RESULTS = SOUP.find_all('div', {'data-component-type': 's-search-result'})
for item in RESULTS:
RES = extract_record(item)
if RES:
DATA.append(RES)
DRIVER.close()
df = pd.DataFrame(DATA, columns = ['Name', 'Price', 'Ratings', 'Review_Count', 'URL'])
df.to_csv(file_name, index=False)
driverFunction() function will export a amazon_scape_data.csv file which can be used to look for suitable product and for future analysis purpose
if __name__=="__main__":
driverFunction("msi_laptop, "Amazon_MSI_Laptops.csv")
Let’s take this project to next step
Now that we have developed a way to scrap data, we can perform some analysis and visual representation on any data we want.
Let’s look into some MSI Laptops on Amazon India -
main("msi laptop", "Amazon_MSI_Laptops.csv")
NOTE : -
— On amazon most of the important technical details are mentioned in the name itself.
— Prices may vary from the time of data extraction
ex — MSI GF75 17.3" FHD 120Hz Thin Gaming Laptop, 10th Gen Intel Core i5–10300H, Backlight Keyboard, HDMI, Wi-Fi 6, Webcam, Amazon Alexa, USB-C, GeForce GTX 1650, Windows 10 (32GB RAM|512GB PCIe SSD)
This products name also includes the screen refresh rate, processor name, graphics card detail, OS name, storage and RAM details. So these will be sufficient for initial screening.
Reading and exploring data
First import the matplotlib and seaborn packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
Read —
col = ['Name', 'Price', 'Ratings', 'Review_Count', 'URL']MSI = pd.read_csv('Amazon_MSI_Laptops.csv')
MSI.columns = colMSI.head()
We should get some basic idea about the data that we have collected -
print("Number of laptops searched:", len(MSI))
print("Info on columns of MSI:\n")
print(MSI.info())
Some laptops from other companies have also been scraped due to the sponsors and advertisements of other companies. So, we need to remove those too. We’ll remove then and also remove all other unwanted data from our dataset.
Cleaning the dataset
Before starting any further with the dataset we’ll remove the laptops which are not from MSI company
fromMSI = [True if 'msi' in MSI.loc[row,'Name'].lower() else False for row in MSI.index]
MSI = MSI.loc[fromMSI]
Some laptops have been duplicate data in the dataset. We need to remove those for increasing accuracy.
MSI.drop_duplicates(‘Name’, inplace=True)
We have seen that Price, Ratings and Review_Count are in string format. We’ll modify it later.
We need to check for null values first.
print(“Number of Null values in each column:\n”)
print(MSI.isnull().sum())
From above observation we can see that 24 Laptops have no Ratings. Not having a rating may concern some people so, we’ll add 0 to the rating of laptops which do not have any rating and also change the datatype of this column to float.
MSI[‘Ratings’] = MSI[‘Ratings’].str[:3].astype(float)
MSI.rename({‘Ratings’:’Ratings_5'}, axis=1, inplace=True)
MSI[‘Ratings_5’] = MSI[‘Ratings_5’].fillna(0)
print(MSI[‘Ratings_5’].head())
Remove rows with null values
MSI_clean = MSI.dropna()
MSI_clean.isnull().sum()
All the rows with null values have been removed. Now we can perform some visualizations and analysis.
Creating processor column -
Let’s see how many laptops are having Intel processor and how many are having AMD processor. Since there is no column mentioning the processor names specificalyy, we’ll have to create it.
We are concerned about only the latest processors from Intel i.e., i3, i5, i7 and i9. AMD names its processors like Ryzen 7 or r7 therefore, we’ll be having different approach of extraction for both of the processor companies.
intel = [name for name in MSI_clean['Name'] for gen in ['i3', 'i5','i7','i9'] if gen in name.lower()]amd = [name for name in MSI_clean['Name'] for gen_name in ['r3', 'r5', 'r7','r9', 'ryzen'] if gen_name in name.lower()]print('Number of Laptops with Intel processor:', len(intel))
print('Number of Laptops with AMD processor:', len(amd))
print('Number of Laptops without processor mentioned: ', len(MSI_clean)-len(intel)-len(amd))
After removing the rows the index values of every row is still the same as it was before removing null rows. Therefore we need to correct it otherwise it will be defficult to access data from dataset.
MSI_clean.index = [*range(len(MSI_clean))]
Now, we need to create a column where every laptops processors name is mentioned.
processor = []
for row in MSI_clean.index:
if MSI_clean.loc[row, 'Name'] in intel:
processor.append('Intel')
elif MSI_clean.loc[row, 'Name'] in amd:
processor.append('AMD')
else:
processor.append(np.nan)MSI_clean['Processor'] = processor
Let’s check if the processor column has been added in the dataset or not.
MSI_clean.head()
Some laptops may not have mentioned the processor -
MSI_clean.isnull().sum()
We can seen that there were some laptops whose processor name was not mentioned anywhere. So, we need to remove these laptops from our dataset as it is of no concern for us.
MSI_clean = MSI_clean.dropna()
MSI_clean.isnull().sum()
Number of laptops remaining in dataset -
print("Number of laptops left in dataset:",len(MSI_clean))
Now we just need to change Price and Reviews in numerical format.
Price —
MSI_clean['Price'] = MSI_clean['Price'].str.replace(",", "").str[1:].astype(float)
MSI_clean.rename({'Price':'Price_Rs'}, axis=1, inplace=True)
print(MSI_clean['Price_Rs'].head())
Reviews —
MSI_clean['Review_Count'] = MSI_clean['Review_Count'].str.replace("More Buying Choices", '0')
MSI_clean['Review_Count'] = MSI_clean['Review_Count'].astype(float)
MSI_clean['Review_Count'].head()
Visualization -
We’ll plot a Barplot to see how many laptops have Intel processor and how many have AMD processor
val = MSI_clean['Processor'].value_counts()
sns.barplot(x=['Intel', 'AMD'], y=val)
From the above graph we can assume that either MSI doesn’t produce much laptops with AMD processors or most of the models with AMD laptops are out of stock.
Now, let’s see the distribution of laptops on the basis of rating and price.
freq = []
rates = ['<1', '1-2', '2-3', '3-4', '4<']
for rate in range(1, 6):
count = 0
for rating in MSI_clean['Ratings_5']:
if rate-1<=rating<rate:
count+=1
freq.append(count)explode = (0.1, 0.1, 0.1, 0.1, 0.1)fig1, ax1 = plt.subplots(figsize=(12,10))
ax1.pie(freq, explode=explode, autopct='%1.1f%%', shadow=True, startangle=90, radius = 0.9)
ax1.legend(rates)
ax1.axis('equal')
plt.title("Distribution of percentage of laptops based on ratings")
plt.show()
We can observe that 50% of the laptops have rating in range 0–1 stars, 43.8% in range 3–5 stars and 6.26% in range 1–2 stars. So, we can say that current customers ka satisfied with most of the products available.
priceFreq = []
priceRange = ['<50k', '50k-70k', '70k-100k', '100k-150k', '150k<']price_50k_below = MSI_clean.loc[MSI_clean.loc[:,'Price_Rs']<50000]
priceFreq.append(len(price_50k_below))price_50k_to_70k = MSI_clean.loc[(50000<=MSI_clean.loc[:,'Price_Rs']) & (MSI_clean.loc[:,'Price_Rs']<70000)]
priceFreq.append(len(price_50k_to_70k))price_70k_to_100k = MSI_clean.loc[(70000<=MSI_clean.loc[:,'Price_Rs']) & (MSI_clean.loc[:,'Price_Rs']<100000)]
priceFreq.append(len(price_70k_to_100k))price_100k_to_150k = MSI_clean.loc[(100000<=MSI_clean.loc[:,'Price_Rs']) & (MSI_clean.loc[:,'Price_Rs']<150000)]
priceFreq.append(len(price_100k_to_150k))price_150k_above = MSI_clean.loc[150000<=MSI_clean.loc[:,'Price_Rs']]
priceFreq.append(len(price_150k_above))fig1, ax1 = plt.subplots(figsize=(12,10))
ax1.pie(priceFreq, explode=explode, autopct='%1.1f%%', shadow=True, startangle=90, radius = 0.9)
ax1.legend(priceRange)
ax1.axis('equal')
plt.title("Distribution of percentage of laptops based on price range")
plt.show()
We can see that Most of the laptops i.e., 63.7% are in mid to high price range i.e. above Rs.70K and none of the laptops have price below Rs.50k
Let’s create a function which can return the list of laptops in whatever range the user inputs -
def getLaptops(start=0, end=0):
l = min(start, end)
r = max(start, end)
Laptops = MSI_clean.loc[(MSI_clean.loc[:,'Price_Rs']>=l) & (MSI_clean.loc[:,'Price_Rs']<=r)]
return(Laptops)
Print the returned list -
Laptops_in_given_range = pd.DataFrame(getLaptops(20000, 60000))
Laptops_in_given_range.index = [*range(len(Laptops_in_given_range))]
Laptops_in_given_range# We can also directly print the list returned from the function in the following way but, it won't look elegant
# print(getLaptops(20000,60000))
Now let’s see the least and the most in every necessary attribute
Price —
Most expensive:
mostExpensive = MSI_clean.loc[MSI_clean.loc[:,'Price_Rs']==max(MSI_clean.loc[:,'Price_Rs'])]
if len(mostExpensive)>1:
for r in mostExpensive.index:
print(r, mostExpensive.loc[r,"URL"])
else:
print(*mostExpensive["URL"].values)
mostExpensive
Cheapest:
cheapest = MSI_clean.loc[MSI_clean.loc[:,'Price_Rs']==min(MSI_clean.loc[:,'Price_Rs'])]
if len(cheapest)>1:
for r in cheapest.index:
print(r, cheapest.loc[r,"URL"])
else:
print(*cheapest["URL"].values)
cheapest
Rating —
Highest rated:
highestRated = MSI_clean.loc[MSI_clean.loc[:,'Ratings_5']==max(MSI_clean.loc[:,'Ratings_5'])]
if len(highestRated)>1:
for r in highestRated.index:
print(r, highestRated.loc[r,"URL"])
else:
print(*highestRated["URL"].values)
highestRated
Least rated:
leastRated = MSI_clean.loc[MSI_clean.loc[:,'Ratings_5']==min(MSI_clean.loc[:,'Ratings_5'])]
if len(leastRated)>1:
for r in leastRated.index:
print(r, leastRated.loc[r,"URL"])
else:
print(*leastRated["URL"].values)
leastRated
Review s—
Most Reviewed:
mostReviewed = MSI_clean.loc[MSI_clean.loc[:,'Review_Count']==max(MSI_clean.loc[:,'Review_Count'])]
if len(mostReviewed)>1:
for r in mostReviewed.index:
print(r, mostReviewed.loc[r,"URL"])
else:
print(*mostReviewed["URL"].values)
mostReviewed
Least Reviewed:
leastReviewed = MSI_clean.loc[MSI_clean.loc[:,'Review_Count']==min(MSI_clean.loc[:,'Review_Count'])]
if len(leastReviewed)>1:
for r in leastReviewed.index:
print(r, leastReviewed.loc[r,"URL"])
else:
print(*leastReviewed["URL"].values)
leastReviewed
Conclusion
Using the code we can extract a .csv file with fields Name, Price, Rating, Review count and URL of any product from Amazon India.
We can use this .csv file to create a DataFrame and visualize it or print out some specific data as per our need. For some products additional modification in the code may be needed to get the desired data as we have done for MSI Laptops. Based on the graphs and different tables printed in project one can come to a probable decision about any product.
For this project we can conclude that most of the MSI laptops are in medium to high price range and most of them use Intel processor. Almost 50% i.e., 16 laptops have no ratings or reviews. The cheapest and most expensive price of a laptop is Rs.53,990 (rating = 3.3 stars, reviews = 7) and Rs.2,99,999 (rating = 0, reviews = 0) respectively.
The most reviewed and preferable model is MSI Bravo 15 Ryzen 7 4800H. It has price = Rs.75,990, rating = 4.2 stars and reviews = 53