Web Scrapping 101: Beautiful Soup
Hello World! If you read this article it means you come to my first post wohoo! This time I want to share about web scrapping using beautiful soup in simple way đ So without any more chitchat letâs go to the main part!
So what actually is web scrapping?
Imagine that when you want to send some informations from websites to user through your chatbot. When user type something to your chatbot e.g. âwhat is latest film?â, your chatbot will collect informations to web e.g Netflix to get latest release films. Letâs say it consists of film title, film duration and genre. Those results could be serve to user in a form of carousel. Yep! you just doing web scrapping actually. Web scrapping definition according to wikipedia is:
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
Why use web scrapping?
Every time we gonna do something, we should ask ourselves âwhyâ? Alright, the reason why we should use web scrapping is.
- Its useful for end-to-end testing
Have you heard about Selenium? Itâs a browser for automating web applications usually for testing purpose, but itâs not limited to that. Perhaps in the next article we will discuss more about Selenium for web scrapping purpose. - We could build datasets to train our Model!
Sounds crazy? But, yeah! If you want to build any of machine learning or deep learning models which needs data for training phase but you canât get it because the websites didnât provides the API (Application Programming Interface). You may consider using web scrapping to get the data by automate rather than doing manually. - Business Purpose
Your boss ask you to collect data from their competitorâs website. Your company need to compare their data with competitiors to derive business insights. Then you may consider to use web scrapping to get the data.
Need to know that some websites may forbid us to do scrap some of their pages. I suggest to check their robots.txt or asks them directly.
Thatâs the reason why we should web scrapping based on my experience. Is there any other reason in your opinion? Lets share and discuss it in comments!
Alright, thatâs some explanation about web scrapping, lets move into more practical session a.k.a Beautiful Soup!
What is Beautiful Soup?
Wait?! Do you mean this âsoupâ?
No, no. Its not the soup as image above. Itâs one of web scrapping library in Python. With beautiful soup, we can get data from HTML or any markup language. Beautiful Soup helps you to get particular content from websites, clean the HTML and serves the information to you. You need Python and pip
if you want to hands on with this tools. I assume that you have been installed both of those on your machine.
If you donât have Python yet. You could install it from here, the docs explain of installing Python on Linux, Mac and Windows so no need to worry!.
If you havenât install pip yet as well, you may check this websites that explain how to install pip in various operating systems.
Have you installed them? Alright, first of all. We have to install beautiful soup using pip
command:
pip install beautifulsoup4
Also, we need to install lxml
which is Python library to parse HTML and XML files with this command:
pip install lxml
When we work with web page, we might need requests
library to send HTTP request to websites. You can install it with:
pip install requests
I would suggest you to use
virtualenv
(virtual environment), this allows you to have different package/libraries for each you work on. Itâs like you have local package so it wonât break global package that may cause conflict with other projects. Perhaps we could discuss about it in Medium next time.
In this section, I want to create simple csv
file consists of team name and their region with details as following:
+----------------+--------------------------+
| region | team name |
+----------------+--------------------------+
| North America | [Chaos, Demon] |
| South America | [beastcoast, Infamous] |
| Europe | [Alliance, Nigma] |
| China | [Newbee, PSG LGD] |
| ..... | [... ,.... ,....] |
+----------------+--------------------------+
There are some ways to create csv
files but I would like to use Pandas as itâs already provides function called to_csv()
to saves dataframe into csv
. This is the command if you might want to install Pandas to your environment.
pip install pandas
Okay, time to write some code! In this section, we will scrap some content from this website as I loved to play Dota and watches some of their tournaments including The International đ (donât ask my MMRđ˘ you may check my latest play on dotabuff )
First of all, import all libraries that we have installed it before to scrap the page.
Next step is, we want to send http requests to liquipedia using requests
library and parse it with lmxl
.
This will output the HTML structure from liquipedia in pretty form by using prettify()
.
To create region column, I need to scrap the region lists first by write this code.
In beautiful soup, there is function called find_all
to get all HTML elements based on parameter inside the function. Insides find_all
function we passed div
and class_='panel-heading'
to get the element and loop using list comprehension to get the each region text.
Next, we want to get the team name in each region. I did this by write this code.
find_all('div', class_='panel-body')
means to get all text insides the tags. But it will results like this:
Chaos Esports Club Demon Slayers Evil Genuises Fighting PandaS
With space at first and end of text, also two spaces between each team. To deal with this, Iâm doing strip()
to remove start and end spaces, then split(' ') #double spaces
to split the each team split double spaces which result in a list (I guess this is a bad approach).
[Chaos Esports Club, Demon Slayers, Evil Genuises, Fighting PandaS]
Then I am using loop in list comprehension to get all teams in each region (the results will be a 2-dimensional list).
Finally, we dive into the Last Part of this section. Time to wrap them all!
Before we save it to csv files we should create the DataFrame first, I am doing this by write codes:
To create DataFrame using Pandas, simply call DataFrame class with dictionary parameters with key
as the column name, and value
as the value of the column. And last thing, save it to csv
using to_csv(filename)
.
Thatâs it for my first post which explain about web scrapping using Beautiful Soup. If you want to practice more with this library you may check the docs. Hit clap if you like this article or find it usefulđ
I am very welcome with feedbacks from you, just comments on this post if any.
Thankyou and see you on next matchmaking đŽ (a.k.a article)!