Web scraping using python for automation

Hansel
Bina Nusantara IT Division
3 min readJun 28, 2022

Web scraping is a process of collecting data from a website’s html files. Gathering lots of data from a web can be tedious at times. For me, if the data I wanted to gather is relatively simple, I personally prefer writing a script to automate things. If we could assign simple tasks to bots, why not?

Photo by Possessed Photography on Unsplash

I chose to write it in python because it’s quick to write and easy to execute. After installing python and your favorite IDE in your system, there are several python packages that are needed to be installed to get things going. These packages can be installed using pip, a package manager that usually comes with python when you installed it for the first time. To get started, run these commands in your command prompt:

pip install requests
pip install bs4

After installing those libraries, you can create scripts using the requests library to get the raw html values of a website, and bs4 to make the html more visually appealing while also making it easier to be filtered afterwards. Here is a simple script to get all html data from the wikipedia page for the python programming language:

import requests
from bs4 import BeautifulSoup
BASE_URL = "https://en.wikipedia.org/wiki/Python_(programming_language)"html = requests.get(BASE_URL)
soup = BeautifulSoup(html.content, 'html.parser')
print(soup)

We could further get specific information by looking for the html tags of such information. For example, let’s say that we wanted to get the data of the table of contents of such website. One easy way to look for the tags is by using the inspect feature. To do this, start by pressing f12 and Ctrl+Shift+C, then click on the container that stores the data that you wanted to collect.

Photo by Author

We can see that wikipedia uses a <div> with the id “toc” to store the table of content, and <li> tags with the class “toctext” to store each elements of the table. With such information, we could filter the soup variable that we created earlier, as such:

Photo by Author

Voila, we just gathered the entire table of content. These data is still pretty raw, and of course we could further add logic and use several other python libraries to make the data more presentable, but with these basic knowledge, we could make any scripts based on the information that you wanted to get from a website.

--

--