Learn Web Scraping by Example: Scraping Real Data from an E-commerce Site

Isaac Oresanya
14 min readSep 17, 2023

Have you ever wanted to buy something online but felt overwhelmed by the number of choices and options? How do you find the best deal for your budget and preferences? You could spend hours browsing through different websites and comparing different products. Or you could use web scraping to do the work for you. Web scraping is a technique that allows you to extract data from websites and store it in a structured format. For example, imagine you want to buy a new laptop. Wouldn’t it be easier if you could have all the data you need in one place? A table that shows you the price, brand, model, processor, memory, storage, and rating of each laptop at a glance. This way, you can easily compare different laptops and find the one that suits you best. This is what web scraping can do for you.
In this article, I will show you how to use web scraping: how to get laptops information from Jumia, an e-commerce website in Nigeria. We will use Python as our programming language and BeautifulSoup as our web scraping library. By the end of this article, you will have a basic understanding of web scraping and how to apply it to your own projects.

Understanding Web Scraping Basics

HTML (Hypertext Markup Language) is the standard language for creating web pages. It uses a hierarchical structure of tags and attributes to define the content and layout of a web page.

HTML element

Tags:

HTML uses various tags to define different elements on the page. Tags are enclosed in angle brackets, and most have opening and closing tags to encapsulate content. Some common tags include:
<h1>, <h2>, <h3>, ... <h6>: Headings.
<p>: Paragraphs.
<a href="URL">: Links.
<ul>: Unordered lists.
<div>: Division or container.

Attributes:

Tags can have attributes that provide additional information or settings. For example:
href: Specifies the URL for a link.
class: Assigns a CSS class for styling.
id: Provides a unique identifier for scripting or styling.

Here’s an example of a simple HTML document:

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Div with Class, ID, and Anchor</title>
</head>
<body>
<div class="my-div" id="unique-div">
<p>This is a div tag with a class and an id.</p>
<a href="https://www.example.com">Visit Example.com</a>
</div>
</body>
</html>

The <div> tag has a "class" called "my-div" and an "id" called "unique-div."
The <a> element (which creates links) has a "href" attribute set to "https://www.example.com," making it a link to example.com.

P.S: You don't have to remember all of HTML. Just knowing the basic structure is enough to get started and continue.

BeautifulSoup & Requests libraries

BeautifulSoup and Requests are two Python libraries that are useful for web scraping. Requests is a library that allows you to send HTTP requests to web servers, just like when you type something into your Google search bar. The get() function takes in the URL of the page you want to scrape and returns a Response object that contains the page’s HTML content. You can assign this content to a variable and use BeautifulSoup to parse it. Parsing means analyzing the structure and meaning of the HTML code and converting it into a Python object that you can manipulate. With BeautifulSoup, you can navigate through the elements of the HTML document using various methods and attributes, such as find(), find_all(), select(), name, text, etc. You can also extract information from the elements, such as their text content, attributes, or links. This way, you can scrape data from web pages and use it for your own purposes.

Scraping jumia.com.ng

Okay, enough of theory, let get to writing codes
Launch your web browser and go to the web page you want to scrape. In our case, we’re interested in scraping laptop information, so let’s go to the laptop page.

Now, our next step is to inspect the page. Most web browsers offer this feature. Inspecting a webpage means looking at its source code, images, styles, scripts, and other elements that make up the webpage.

In Google Chrome, you can right-click anywhere on the webpage and choose "Inspect" from the menu. Alternatively, press Ctrl+Shift+I (or Command+Option+I on macOS) on your keyboard.

Now, back to Jupyter Notebook, which is a web application that lets you create documents containing Python code, visualizations, and text. We need to import some essential libraries for our project:

Requests: This library allows us to send HTTP requests to web servers.
BeautifulSoup: It helps us parse the HTML of the webpage.
Pandas: We’ll use Pandas to save our final results in a structured format, like a dataframe (a table), and also to export our data to a CSV file for further analysis.

Importing python libraries

If you are using Jupyter Notebook, you can use the conda command to install BeautifulSoup from the Anaconda channel1. For example, you can run the following command in your terminal or in a code cell:

conda install -c anaconda beautifulsoup4

If you are using another IDE, such as VSCode, you might need to use the pip command to install BeautifulSoup from PyPI . For example, you can run the following command in your terminal or in a terminal window within VSCode:

pip install beautifulsoup4

Setting up and sending a web scraping request.

This step involves defining the URL of the web page you want to scrape, defining the user agent to mimic a web browser, setting the headers with the user agent, and sending an HTTP GET request to the URL with the headers. This step is necessary to access the web page content and avoid getting blocked by anti-scraping measures.

# Define the URL to scrape

url = "https://www.jumia.com.ng/catalog/?q=laptop"




# Define the user agent to mimic a web browser

user_agent = "Mozilla/5.0 (X11; CrOS x86_64 10066.0.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 testscraper (+your email)"




# Set headers with the user agent

headers = {"User-Agent": user_agent}



The URL is the web page’s address on the internet.
The user agent is a string that tells the website about your browser and operating system. It helps websites identify what kind of device you’re using.
When scraping jumia.com.ng, it’s important to follow their terms of service. This often includes identifying your web scraping bot and providing a contact email.
Headers are extra details sent with your request to the web server. They contain information like the type of data your request is expecting or the type of browser you’re using.

Moving forward, we can send our HTTPS request by passing the URL and headers as arguments and keyword arguments respectively, and assigning the result to a variable “response”. Printing the response will show us the status code, which is a number that indicates whether our request was successful or not.

HTTP request was successful

A status code of 200 means OK, which means that our request was satisfied and the web server sent back the content of the web page we requested.
Other status codes, such as 403, 500, or 404, mean that there was an error or a problem with our request or the web server.
For example, a status code of 403 means Forbidden, which means that we do not have permission to access the web page or resource we requested. A status code of 404 means Not Found, which means that the web page or resource we requested does not exist on the web server.
Therefore, it is important to check the status code of our response before proceeding with our web scraping process.

Parsing the web page content using BeautifulSoup.

This step involves creating a BeautifulSoup object from the HTML content of the web page, using the lxml parser, and assigning it to a variable"soup".

# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, "lxml")
content = soup.prettify()
print(content)
HTML code after parsing with BeautifulSoup

The lxml parser is a fast and reliable parser that can handle different types of HTML documents. You can also use other parsers, such as html.parser, depending on your preference. The prettify() method returns a string that contains the formatted HTML content of the web page, with proper indentation and line breaks. This makes the HTML code more readable and easier to understand.

Notice that the HTML content remains exactly the same as what you see in the inspect tool. BeautifulSoup maintains the original web page structure, ensuring precise data extraction.

HTML code on the Inspect tool

With that done, we can use the find(), find_all(), select(), name, text, and attrs methods and attributes to locate and extract elements from the HTML document.

Using BeautifulSoup’s "find" Method

Suppose we are looking for the first element that satisfies our search condition, we can use the BeautifulSoup’s "find" method to access this. It conducts its search from top to bottom, and returns the first HTML line that satisfies the condition. The "find" method takes one positional argument and as many keyword arguments.

To print the name of the first laptop on the Jumia page, we make use of our inspection tool. By examining the HTML code while simultaneously viewing the webpage, we can pinpoint the specific section of the page that our code is targeting.

With this understanding, we return to the webpage, scroll through the HTML lines to locate the relevant section containing the name.

The line selected highlights the name of the laptop

Next, we identify the associated HTML tag and the unique attribute that distinguishes it, which, in this case, is the class. The tag is “h3” and the class is “name”. These are the criteria that we will use to search for the element using BeautifulSoup.

Note: Use 'class_' instead of 'class' as the attribute name. This avoids conflicts with Python’s 'class' keyword, ensuring smooth data parsing and preventing naming issues.

# Find and print the element of the first item’s name
first_item = soup.find("h3", class_="name")
print(first_item)

This returns the name of the first item on the Jumia page, along with the HTML tag and attributes, which is:

Printing the name of a laptop

To extract the name without the tag and attributes, we can apply the "text" method. The text method returns the text content of an element, without any HTML tags or attributes. By applying the text method to the element that we found, we can extract and print the name of the first item on the Jumia page.

However, if we want to find all items that satisfy our search condition, we can use the BeautifulSoup’s "find_all" method instead. The "find_all" method performs a similar function to the "find" function, but with some differences.

  • The "find_all" method scrolls through the whole document, and whenever an element matches the search criteria, it appends it to a list.
  • The "find_all" method returns a list of all elements that match the criteria, unlike the "find" method that returns a string of only one element that matches.
  • The "find_all" method takes one positional argument and as many keyword arguments as well.

Scraping Product Data from Jumia

To create a DataFrame in Python using the “pd.DataFrame” constructor from the pandas library, we typically use a dictionary with lists as its values as the data source. Here’s an example:

# Create a DataFrame from a dictionary
data = {'Column1': [value1, value2, value3],
'Column2': [value4, value5, value6]}
df = pd.DataFrame(data)

In our specific case, we aim to scrape information about laptops from a webpage, including the product names, new prices, old prices, and ratings. To achieve this, we first gather this data into separate lists. Then, we can use these lists as the values in our dictionary to create a DataFrame. This allows us to organize and analyze the data efficiently. If you’re curious, let’s explore it.

To begin, we need to initialize empty lists to store the scraped data:

# Initialize empty lists to store scraped data
Name = []
New_Price = []
Old_Price = []
Review = []

Scrape data using the find_all method:

We’ll use the inspect tool to get the tag and class name for the Laptop’s name, new price, old price, and rating. However, there are two sets of laptops on the Jumia web page: the regular laptops and the sponsored laptops.

The arrow points at the sponsored laptop to show the two categories of products on the page

These two sets have different classes, so we need to be careful not to mix them up when you use the BeautifulSoup’s “find” or “find_all” methods. We can use the class_ argument to specify the exact class name that we want to search for. We want to focus on the regular laptops for this project.

Find and Extract Product Information from Items

Each laptop on the Jumia web page is displayed in a rectangular box that contains its details, such as name, price, and rating. The HTML code for each box has the same tag name and class name, which are "article" and "prd _fb col c-prd", respectively.

The line selected on the inspect tool shows the part of the page it works for by highlighting it
Each of these lines represent each boxes of item on the page

We can use the BeautifulSoup’s "find_all" method to find all the elements that match these criteria, and store them in a list called "items".

# Find all laptop boxes
items = soup.find_all("article", class_="prd _fb col c-prd")

This will return a list of all the laptop boxes that have the class attribute equal to `"prd _fb col c-prd"`. We can then loop through this list and extract the information that we want from each laptop, such as name, new price, old price, and rating.

Looping through the Laptop Boxes and Extracting the Information

Now that we have a list of laptop boxes in the “items” list, we can loop through each box and extract the information that we want, such as name, new price, old price, and rating. However, we need to be careful because some laptops do not have an old price or a rating, so we need to check if the element contains the old price or the rating exists before we extract its text. If it does not exist, we can use an empty string as a placeholder. This is important because we need to have the same number of values in each list to create a DataFrame later. Here is how we can do this:

for item in items:

    # Find and append the name

    name = item.find("h3", class_="name").text

    Name.append(name)




    # Find and append the price

    price = item.find("div", class_="prc").text

    New_Price.append(price)




    # Check if old price exists, otherwise, append an empty string

    old_price_elem = item.find("div", class_="old")

    if old_price_elem:

        old_price = old_price_elem.text

    else:

        old_price = ""

    Old_Price.append(old_price)




    # Check if rating exists, otherwise, append an empty string

    rating_elem = item.find("div", class_="stars _s")

    if rating_elem:

        rating = rating_elem.text

    else:

        rating = ""

    Rating.append(rating)

The code does the following:
- It loops through each item in the "items" list and performs the following steps:
- It finds the name of the product by looking for an "<h3>" element with the class attribute "name" and extracts its text content. It appends this name to the "Name" list.
- It finds the current price of the product by looking for a "<div>" element with the class attribute "prc" and extracts its text content. It appends this price to the "New_Price" list.
- It checks if there is an old price of the product by looking for a "<div>" element with the class attribute "old". If it exists, it extracts its text content. Otherwise, it assigns an empty string. It appends this old price or empty string to the "Old_Price" list.
- It checks if there is a rating of the product by looking for a "<div>" element with the class attribute "stars _s". If it exists, it extracts its text content. Otherwise, it assigns an empty string. It appends this rating or empty string to the "Rating" list.
Next, we’ll organize the data neatly into a table format using a DataFrame from the pandas library. To create this table, we’ll use the pandas DataFrame method. Here’s how we can do it:

# Create a DataFrame from the scraped data
product_table = pd.DataFrame(
{"product_name": Name, "new_price": New_Price, "old_price": Old_Price, "rating": Rating})
Printing the dataframe with the data

This DataFrame consists of 40 rows, one for each laptop, and four columns: product_name, new_price, old_price, and rating. This structured table will serve as a solid foundation for further analysis.

If you want to see the full source code for the snippets used in this article, you can check out my GitHub repository here: https://github.com/TheDataIsaac/The-Data-Diaries-with-Isaac/tree/main/Python/Web%20Scraping%20-%20jumia.com.ng

Best Practices when scraping data from the internet

  • Web scraping is a powerful and useful technique, but it should be done responsibly and respectfully. Some websites prohibit any commercial use of its services, unless you obtain its written permission. Therefore, it’s necessary to follow the website’s "robot.txt" file, which provides guidelines on what can and what cannot be scraped. Disobeying those guidelines may result in legal issues.
    You can access the "robot.txt" file of any web page by typing the web site name and "robot.txt" separated by a space". For example "jumia.com.ng robot.txt".
Jumia's "robots.txt" file permits web scraping under specific conditions
  • Sending too many requests too quickly is unethical and risky, as it can cause harm to the website’s performance, security, and quality by imposing excessive load, bypassing protection measures, or altering the content. Implement rate limits and delays between requests to avoid overloading the server. Using python’s time package, you can set breaks between each set of requests you’re sending.
  • With the help of additional functions, it is also possible to scrape websites that require users to log in to access contents. However, it is important to note that scraping websites that require user authentication can be a violation of the website’s terms of service, and you should also mindful of the privacy of the users of the website. If you are scraping user data, you should make sure that you have the permission to do so.

Conclusion

Web scraping is a valuable technique that can simplify the process of gathering information from websites, as demonstrated in this article. By using Python and libraries like BeautifulSoup, you can extract data from web pages, structure it, and use it for various purposes. We learned the basics of HTML tags and attributes, how to send HTTP requests, and how to parse web page content.

In the case of scraping Jumia’s website for laptop information, we saw how to locate and extract product names, prices, and ratings efficiently. This data was then organized into a DataFrame for further analysis.
However, it’s important to emphasize responsible web scraping practices. Always respect a website’s terms of service and guidelines, use rate limits to avoid overloading servers, and consider the privacy implications when scraping user data or websites requiring authentication.

--

--