Web Scraping using Beautiful Soup in Python
Hello there. This is my first article. And I am going to talk about web scraping in python. There are mainly two ways to extract data from a website.
- Using Website API(if exists)
- Access the HTML of the webpage and extract useful information.
Basically there are two important tools for accessing HTML content that are Beautiful Soup and Selenium.
Get started with Beautiful Soup
Beautiful Soup is a python library for extracting data from HTML and XML. It provides your favorite parser so you can easily navigate and search. It commonly saves programmers hours and days. It’s main advantage is that it uses headless chrome. That’s why it is beautiful.
Installing Beautiful Soup
If you’re using a recent version of Debian or Ubuntu Linux, you can install Beautiful Soup with the system package manager:
$ apt-get install python-bs4 (for Python 2)$ apt-get install python3-bs4 (for Python 3)
Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system packager, you can install it with easy_install
or pip
.
$ easy_install beautifulsoup4$ pip install beautifulsoup4
With Beautiful Soup, you also need to install your favorite parser either it is XML or HTML along with you also need to install requests.
$ pip install lxml
$ pip install html5lib
$ pip install requests
Accessing the HTML content from webpage
import requests
r = requests.get(“https://www.google.com/")
print(r.content)
Let’s understand above code
1. First line of code import the request library for requesting the web page. The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library
2. Second line gets the request from the requested URL. Because When we visit a web page, our web browser makes a request to a web server. This request is called a GET
request, since we’re getting files from the server. The server then sends back files that tell our browser how to render the page for us.
3. Third line prints the HTML content of web page
Parsing the HTML Content
From the above code:
r.content: It is the raw HTML content.
html5lib: Specify the HTML parser that we want to use.
soup.prettify() gives the visual representation of the parse tree created from HTML content.
Navigate and Search through parse tree
Now, we want to extract useful information from parse tree for that we have to navigate through tree and find out a particular node. Because the soup object contains all data in a structured format and programmatically we can extract. Let’s take an example.
Suppose if you want to find all hyperlink tag ‘a’ from google then we can find from soup object using soup.find_all(tagname) method.
Output:
https://www.google.co.in/imghp?hl=en&tab=wi
https://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
https://www.youtube.com/?gl=IN&tab=w1
https://news.google.co.in/nwshp?hl=en&tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/about/products?tab=wh
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en
Searching through CSS Selector
You can also search an element using CSS selector. These selectors are how the CSS language allows developers to specify HTML tags to style. Here are some examples:
p a
— finds alla
tags inside of ap
tag.body p a
— finds alla
tags inside of ap
tag inside of abody
tag.html body
— finds allbody
tags inside of anhtml
tag.p.outer-text
— finds allp
tags with a class ofouter-text
.p#first
— finds allp
tags with an id offirst
.body p.outer-text
— finds anyp
tags with a class ofouter-text
inside of abody
tag.
BeautifulSoup objects support searching a page via CSS selectors using the select
method. We can use CSS selectors to find all the p
tags in our page that are inside of a div
like this:
soup.select("div p")Output:[<p class="inner-text first-item" id="first">
First paragraph.
</p>, <p class="inner-text">
Second paragraph.
</p>]
Note that the select
method above returns a list of BeautifulSoup
objects, just like find
and find_all
.
Explore page structure through chrome DevTools.
The first thing we’ll need to do is inspect the page using Chrome Devtools.
You can start the developer tools in Chrome by clicking View -> Developer -> Developer Tools
. You should end up with a panel at the bottom of the browser like what you see below. Make sure the Elements
panel is highlighted:
Now if you want to find the rating of the historical place of India. From soup content, we can find the below code.
<div>
<span style=”margin-right:5px” class=”oqSTJd” aria- hidden=”true”>4.6</span>
</div>
So from that, we can understand that we have to find span class that has “oqSTJd” class and it’s a child of “div” tag.
In above code in soup.find_all() method contains attrs parameter. It contains either class or id. Now we can combine this data into Dataframe.
Combining our data into a Pandas Dataframe
import pandas as pd
place = pd.DataFrame({
"Name": placeName,
"Rating":rating
})
print(place)
Output:
Name Rating
0 india gate 4.6
1 Kutub Minar 4.5
So from the data frame, you can perform Exploratory Data Analysis and apply some machine learning algorithm.
You should now have a good understanding of how to scrape web pages and extract data. A good next step will be picking a good web site and scrap important data and analyze it. You can also refer official documentation.