Introduction to Data Scraping in Python
Python is one of the programming languages that could use in many software products. One of Python’s uses is for data scraping. Data scraping is a technique to extract data from another program’s outputs, for example, the view of a website. As we know, a website page must be structured in HTML.
The data scraping could extract the information contained in the HTML tags. One data scraping usage is to generate a website the contents are from another website. No, I’m not talking about plagiarism or pirating. Let’s look at a website called Inkuiri.com so you don’t misunderstand it.
Inkuiri.com is a website that re-provides information on a product sold by many e-commerce operated in Indonesia. They do data scraping on Bukalapak, Shopee, Tokopedia, etc., gathered the Setupdata, and then re-providing it on the Inkuiri.com website. The purpose is so people can find the product they want in a more wide range and more competitive price. In this case, the data scraping activity Inkuiri.com did could also be called web scraping.
As of 2023, the Inkuiri seems to have stopped their operation. You could take a look at the website using the Wayback Machine.
The most popular data scraping tools for Python are the BeautifulSoup library and the Scrapy framework. In this article, I will demonstrate how to do data scraping using BeautifulSoup. Let’s get on it!
You could find the complete code covered in this article on this GitHub repository.
What would we scrape?
Take a look at https://www.jadwalsholat.org/. The website provides the Moslem prayer schedule for Indonesia. Let’s try to scrape the website, and provide the information of prayer schedule on a certain day or city.
Project Setup
First, open your terminal and create a directory as our project home base.
mkdir learn-scraping/
Next, make a virtual environment by using the virtualenv
command, then activate it.
virtualenv venv
source venv/bin/activate
Okay, now let’s install the libraries we need to scrap data. The first library is called request
that we will use to do HTTP requests to the targetted website. The other one is the beautifulsoup
package.
pip install requests beautifulsoup4
Analyzing The Target
Before we implement a code to scrape the website, we have to analyze what are the data we need to gather.
- Name of the city and its timezone.
- Current date and prayer schedule.
Next, look from the HTML side using the browser’s ‘inspect element’. In this step, we identify the HTML selector that we want to extract the value from.
The name of the city and its timezone seems to be in the h1
tag and its class identity is h1_edit
.
And we found that the current date prayer schedule was in a tr
and td
tag.
Maybe you realize, that the month and year of the date doesn’t provide in the tr
tag. Well, it seems that the month and year are under the city’s name.
The month and year are provided in h2
tag with h2_edit
class.
We have already identified the data in the HTML selector. So, next…
Let’s Build the Code
First, let’s build the project directory structures, it would look like this one:
learn-scraping
├── app.py
└── scraper
├── __init__.py
└── scraper.py
Inside the scraper
module, we have a file named scraper.py
. This file would be a module that we use to scrape the website.
Define a class in this module named Scraper
.
Next, let’s build the main application to execute the module. We have the file called app.py
.
Take a look at the website’s URL. There, I set a parameter id
and set its value to 94
(https://jadwalsholat.org/adzan/monthly.php?id=94). 94
is the id for a city name ‘Karawang’. You could change that parameter’s value as you like it.
Anyway, we have built our first data-scraping app!
Now, Run It!
Open your terminal, and execute python app.py
. Below is my result.
Alright, now we know what data scraping is and take hands-on of how to do it. But, it is just a little demonstration, there are more you can explore from data scraping! If you want to learn more about the beautifulsoup4
library, you can read the official documentation here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Thank you for reading this article, hope you like it. Any suggestions to improve my writing are welcome!