Introduction to Data Scraping in Python

Ahmad Maulana Malik Fattah
Data Engineering Indonesia
4 min readSep 13, 2021
Photo by James Harrison on Unsplash

Python is one of the programming languages that could use in many software products. One of Python’s uses is for data scraping. Data scraping is a technique to extract data from another program’s outputs, for example, the view of a website. As we know, a website page must be structured in HTML.

The data scraping could extract the information contained in the HTML tags. One data scraping usage is to generate a website the contents are from another website. No, I’m not talking about plagiarism or pirating. Let’s look at a website called Inkuiri.com so you don’t misunderstand it.

Inkuiri.com is a website that re-provides information on a product sold by many e-commerce operated in Indonesia. They do data scraping on Bukalapak, Shopee, Tokopedia, etc., gathered the Setupdata, and then re-providing it on the Inkuiri.com website. The purpose is so people can find the product they want in a more wide range and more competitive price. In this case, the data scraping activity Inkuiri.com did could also be called web scraping.

Inkuiri, a one-portal marketplace that uses data scraping

As of 2023, the Inkuiri seems to have stopped their operation. You could take a look at the website using the Wayback Machine.

The most popular data scraping tools for Python are the BeautifulSoup library and the Scrapy framework. In this article, I will demonstrate how to do data scraping using BeautifulSoup. Let’s get on it!

You could find the complete code covered in this article on this GitHub repository.

What would we scrape?

Take a look at https://www.jadwalsholat.org/. The website provides the Moslem prayer schedule for Indonesia. Let’s try to scrape the website, and provide the information of prayer schedule on a certain day or city.

jadwalsholat.org

Project Setup

First, open your terminal and create a directory as our project home base.

mkdir learn-scraping/

Next, make a virtual environment by using the virtualenv command, then activate it.

virtualenv venv
source venv/bin/activate

Okay, now let’s install the libraries we need to scrap data. The first library is called request that we will use to do HTTP requests to the targetted website. The other one is the beautifulsoup package.

pip install requests beautifulsoup4

Analyzing The Target

Before we implement a code to scrape the website, we have to analyze what are the data we need to gather.

  1. Name of the city and its timezone.
  2. Current date and prayer schedule.

Next, look from the HTML side using the browser’s ‘inspect element’. In this step, we identify the HTML selector that we want to extract the value from.

The name of the city and its timezone seems to be in the h1 tag and its class identity is h1_edit.

Name of the city in HTML tag.

And we found that the current date prayer schedule was in a tr and td tag.

Current date pray schedule HTML tag.

Maybe you realize, that the month and year of the date doesn’t provide in the tr tag. Well, it seems that the month and year are under the city’s name.

Month and Year HTML tag.

The month and year are provided in h2 tag with h2_edit class.

We have already identified the data in the HTML selector. So, next…

Let’s Build the Code

First, let’s build the project directory structures, it would look like this one:

learn-scraping
├── app.py
└── scraper
├── __init__.py
└── scraper.py

Inside the scraper module, we have a file named scraper.py. This file would be a module that we use to scrape the website.

Define a class in this module named Scraper.

Next, let’s build the main application to execute the module. We have the file called app.py.

Take a look at the website’s URL. There, I set a parameter idand set its value to 94(https://jadwalsholat.org/adzan/monthly.php?id=94). 94 is the id for a city name ‘Karawang’. You could change that parameter’s value as you like it.

Anyway, we have built our first data-scraping app!

Now, Run It!

Open your terminal, and execute python app.py. Below is my result.

Run the main application.

Alright, now we know what data scraping is and take hands-on of how to do it. But, it is just a little demonstration, there are more you can explore from data scraping! If you want to learn more about the beautifulsoup4 library, you can read the official documentation here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Thank you for reading this article, hope you like it. Any suggestions to improve my writing are welcome!

--

--

Ahmad Maulana Malik Fattah
Data Engineering Indonesia

Data Engineer || Love to work with data, both in engineering and analytics parts || s.id/who-is-ammfat