Web Scrapping 101: Beautiful Soup

Okza Pradhana
Developer Student Club Universitas Brawijaya
6 min readFeb 10, 2020
Source: https://www.digitalocean.com/community/tutorials/how-to-work-with-web-data-using-requests-and-beautiful-soup-with-python-3

Hello World! If you read this article it means you come to my first post wohoo! This time I want to share about web scrapping using beautiful soup in simple way 😆 So without any more chitchat let’s go to the main part!

So what actually is web scrapping?

Imagine that when you want to send some informations from websites to user through your chatbot. When user type something to your chatbot e.g. “what is latest film?”, your chatbot will collect informations to web e.g Netflix to get latest release films. Let’s say it consists of film title, film duration and genre. Those results could be serve to user in a form of carousel. Yep! you just doing web scrapping actually. Web scrapping definition according to wikipedia is:

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Why use web scrapping?

Every time we gonna do something, we should ask ourselves “why”? Alright, the reason why we should use web scrapping is.

  1. Its useful for end-to-end testing
    Have you heard about Selenium? It’s a browser for automating web applications usually for testing purpose, but it’s not limited to that. Perhaps in the next article we will discuss more about Selenium for web scrapping purpose.
  2. We could build datasets to train our Model!
    Sounds crazy? But, yeah! If you want to build any of machine learning or deep learning models which needs data for training phase but you can’t get it because the websites didn’t provides the API (Application Programming Interface). You may consider using web scrapping to get the data by automate rather than doing manually.
  3. Business Purpose
    Your boss ask you to collect data from their competitor’s website. Your company need to compare their data with competitiors to derive business insights. Then you may consider to use web scrapping to get the data.

Need to know that some websites may forbid us to do scrap some of their pages. I suggest to check their robots.txt or asks them directly.

That’s the reason why we should web scrapping based on my experience. Is there any other reason in your opinion? Lets share and discuss it in comments!

Alright, that’s some explanation about web scrapping, lets move into more practical session a.k.a Beautiful Soup!

What is Beautiful Soup?

Wait?! Do you mean this “soup”?

Source: https://www.bbc.co.uk/food/recipes/chickensoup_1918

No, no. Its not the soup as image above. It’s one of web scrapping library in Python. With beautiful soup, we can get data from HTML or any markup language. Beautiful Soup helps you to get particular content from websites, clean the HTML and serves the information to you. You need Python and pipif you want to hands on with this tools. I assume that you have been installed both of those on your machine.

If you don’t have Python yet. You could install it from here, the docs explain of installing Python on Linux, Mac and Windows so no need to worry!.
If you haven’t install
pip yet as well, you may check this websites that explain how to install pip in various operating systems.

Have you installed them? Alright, first of all. We have to install beautiful soup using pipcommand:

pip install beautifulsoup4
Succesfully install beautifulsoup

Also, we need to install lxml which is Python library to parse HTML and XML files with this command:

pip install lxml
Successfully install lxml

When we work with web page, we might need requests library to send HTTP request to websites. You can install it with:

pip install requests
Successfully install requests

I would suggest you to use virtualenv (virtual environment), this allows you to have different package/libraries for each you work on. It’s like you have local package so it won’t break global package that may cause conflict with other projects. Perhaps we could discuss about it in Medium next time.

In this section, I want to create simple csv file consists of team name and their region with details as following:

+----------------+--------------------------+
| region | team name |
+----------------+--------------------------+
| North America | [Chaos, Demon] |
| South America | [beastcoast, Infamous] |
| Europe | [Alliance, Nigma] |
| China | [Newbee, PSG LGD] |
| ..... | [... ,.... ,....] |
+----------------+--------------------------+

There are some ways to create csv files but I would like to use Pandas as it’s already provides function called to_csv() to saves dataframe into csv. This is the command if you might want to install Pandas to your environment.

pip install pandas

Okay, time to write some code! In this section, we will scrap some content from this website as I loved to play Dota and watches some of their tournaments including The International 😆 (don’t ask my MMR😢 you may check my latest play on dotabuff )

First of all, import all libraries that we have installed it before to scrap the page.

Next step is, we want to send http requests to liquipedia using requests library and parse it with lmxl.

This will output the HTML structure from liquipedia in pretty form by using prettify().

To create region column, I need to scrap the region lists first by write this code.

In beautiful soup, there is function called find_all to get all HTML elements based on parameter inside the function. Insides find_all function we passed divand class_='panel-heading' to get the element and loop using list comprehension to get the each region text.

Next, we want to get the team name in each region. I did this by write this code.

find_all('div', class_='panel-body') means to get all text insides the tags. But it will results like this:

 Chaos Esports Club  Demon Slayers  Evil Genuises  Fighting PandaS 

With space at first and end of text, also two spaces between each team. To deal with this, I’m doing strip() to remove start and end spaces, then split(' ') #double spacesto split the each team split double spaces which result in a list (I guess this is a bad approach).

[Chaos Esports Club, Demon Slayers, Evil Genuises, Fighting PandaS]

Then I am using loop in list comprehension to get all teams in each region (the results will be a 2-dimensional list).

Finally, we dive into the Last Part of this section. Time to wrap them all!

Before we save it to csv files we should create the DataFrame first, I am doing this by write codes:

To create DataFrame using Pandas, simply call DataFrame class with dictionary parameters with key as the column name, and value as the value of the column. And last thing, save it to csv using to_csv(filename) .

That’s it for my first post which explain about web scrapping using Beautiful Soup. If you want to practice more with this library you may check the docs. Hit clap if you like this article or find it useful😄

I am very welcome with feedbacks from you, just comments on this post if any.

Source: https://memeshappen.com/meme/buddy-the-elf/i-love-feedback-feedback-s-my-favorite-71704

Thankyou and see you on next matchmaking 🎮 (a.k.a article)!

--

--

Okza Pradhana
Developer Student Club Universitas Brawijaya

Someone who loves Data Science, Data Engineering, NLP, Machine Learning. Also Front End things.