How to scrape data from an ecommerce website using Python — BeautifulSoup

Francis Kihiko
3 min readJan 23, 2023

--

In this article, I will show you step by step how you can extract data that you need from an ecommerce website using Python library called BeautifulSoup

Photo by Christopher Gower on Unsplash

What is web scraping ?

Web scraping or data extraction is the process of extracting data from websites. This data can be inform of a text, image, and other information that can be used for a variety of purposes , such as market research ,price comparison, sentiment analysis and more.

One of the critical benefit of web scraping is that it allows automation of data collection which can save a significant amount of time and resources. It allows collection of large amount of data that would be difficult or impossible to obtain manually.

In this article, I will be showing you how you can scrape data from an ecommerce website using Python library called BeautifulSoup and export the data to an Excel.

Requirement

BeautifulSoup — is a python library that allows developers to parse HTML and XML documents, making it easy to extract and manipulate data from websites.

Pandas — is a python library that provide a data structure called DataFrame. It allows you to export extracted data to a variety of file formats such as CSV, Excel and SQL databases.

HTML-Basic knowledge of html and the structure of a website is required.

STEP 1: Import the required libraries

import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv

STEP 2: SELECT THE PAGE

In this step, you select the web page that you want to extract the data from. In this project, we will scrape data from an ecommerce website called Jumia.

Phone and tablet category

STEP 3: Seek Permission

In the previous step, We identified the web page that we want to scrape. The next step is to ask for permission from the hosting server.

url=requests.get(f"https://www.jumia.co.ke/mlp-black-friday/phones-tablets/?page=1")

After we have permission from the server, now we need to parse the HTML code using xlml to make it readable.

soup=BeautifulSoup(url.text,'lxml')

STEP 4: Inspect Elements

elements

As we can see from the picture above. Product details are located under the tag called article with a class called ”prd _fb col c-prd”.After we found the tag of each column the next thing we can do is create a for loop to fill an empty variables

product=soup.find_all('article',class_="prd _fb col c-prd")

for item in product:
imgs=item.find("img",class_="img").src
mall_stores=item.find("div",class_="bdg _mall _xs")
names=item.find("h3",class_="name").text
current_prices=item.find("div",class_="prc").text
old_prices=item.find("div",class_="old").text
discounts=item.find("div",class_="bdg _dsct _sm").string
ratings=item.find("div",class_="rev").find("div",class_="stars _s").text
img.append(imgs)
mall_store.append(mall_stores)
name.append(names)
current_price.append(current_prices)
old_price.append(old_prices)
discount.append(discounts)
rating.append(ratings)
#print("page---",page,"---page")
if page==50:
isHaveNextPage=False
page+=1

STEP 5: Create a DataFrame to hold all extracted data

df = pd.DataFrame({'Product Name': name,'Current Price':current_price,'Old Price':old_price,'Discount':discount,'Rating':rating})
df.head(20) #list the first 20 rows
The first 20 rows

STEP 6: EXPORT TO EXCEL

df.to_excel('jumia_data_new_000.csv', index=False, encoding='utf-8')

Full code

import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
page=1
img=[]
mall_store=[]
title=[]
name=[]
current_price=[]
old_price=[]
discount=[]
rating=[]
isHaveNextPage=True
while(isHaveNextPage):
url=requests.get(f"https://www.jumia.co.ke/mlp-black-friday/phones-tablets/?page=1")
soup=BeautifulSoup(url.text,'lxml')
product=soup.find_all('article',class_="prd _fb col c-prd")

for item in product:
imgs=item.find("img",class_="img").src
mall_stores=item.find("div",class_="bdg _mall _xs")
names=item.find("h3",class_="name").text
current_prices=item.find("div",class_="prc").text
old_prices=item.find("div",class_="old").text
discounts=item.find("div",class_="bdg _dsct _sm").string
ratings=item.find("div",class_="rev").find("div",class_="stars _s").text
img.append(imgs)
mall_store.append(mall_stores)
#title.append(titles)
name.append(names)
current_price.append(current_prices)
old_price.append(old_prices)
discount.append(discounts)
rating.append(ratings)
#print("page---",page,"---page")
if page==50:
isHaveNextPage=False
page+=1

df = pd.DataFrame({'Product Name': name,'Current Price':current_price,'Old Price':old_price,'Discount':discount,'Rating':rating})
df.head(20)
df.to_csv('jumia_data_new_000.csv', index=False, encoding='utf-8')

--

--