Web Scraping using Beautiful Soup in Python

Avina Vekariya
DSC DDU
Published in
5 min readApr 9, 2020

Hello there. This is my first article. And I am going to talk about web scraping in python. There are mainly two ways to extract data from a website.

  • Using Website API(if exists)
  • Access the HTML of the webpage and extract useful information.

Basically there are two important tools for accessing HTML content that are Beautiful Soup and Selenium.

Get started with Beautiful Soup

Beautiful Soup is a python library for extracting data from HTML and XML. It provides your favorite parser so you can easily navigate and search. It commonly saves programmers hours and days. It’s main advantage is that it uses headless chrome. That’s why it is beautiful.

Installing Beautiful Soup

If you’re using a recent version of Debian or Ubuntu Linux, you can install Beautiful Soup with the system package manager:

$ apt-get install python-bs4 (for Python 2)$ apt-get install python3-bs4 (for Python 3)

Beautiful Soup 4 is published through PyPi, so if you can’t install it with the system packager, you can install it with easy_install or pip.

$ easy_install beautifulsoup4$ pip install beautifulsoup4

With Beautiful Soup, you also need to install your favorite parser either it is XML or HTML along with you also need to install requests.

$ pip install lxml
$ pip install html5lib
$ pip install requests

Accessing the HTML content from webpage

import requests
r = requests.get(“https://www.google.com/")
print(r.content)

Let’s understand above code

1. First line of code import the request library for requesting the web page. The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python requests library

2. Second line gets the request from the requested URL. Because When we visit a web page, our web browser makes a request to a web server. This request is called a GET request, since we’re getting files from the server. The server then sends back files that tell our browser how to render the page for us.

3. Third line prints the HTML content of web page

Parsing the HTML Content

From the above code:

r.content: It is the raw HTML content.

html5lib: Specify the HTML parser that we want to use.

soup.prettify() gives the visual representation of the parse tree created from HTML content.

Navigate and Search through parse tree

Now, we want to extract useful information from parse tree for that we have to navigate through tree and find out a particular node. Because the soup object contains all data in a structured format and programmatically we can extract. Let’s take an example.

Suppose if you want to find all hyperlink tag ‘a’ from google then we can find from soup object using soup.find_all(tagname) method.

Extract all the “a” tag

Output:

https://www.google.co.in/imghp?hl=en&tab=wi
https://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
https://www.youtube.com/?gl=IN&tab=w1
https://news.google.co.in/nwshp?hl=en&tab=wn
https://mail.google.com/mail/?tab=wm
https://drive.google.com/?tab=wo
https://www.google.co.in/intl/en/about/products?tab=wh
http://www.google.co.in/history/optout?hl=en
/preferences?hl=en

Searching through CSS Selector

You can also search an element using CSS selector. These selectors are how the CSS language allows developers to specify HTML tags to style. Here are some examples:

  • p a — finds all a tags inside of a p tag.
  • body p a — finds all a tags inside of a p tag inside of a body tag.
  • html body — finds all body tags inside of an html tag.
  • p.outer-text — finds all p tags with a class of outer-text.
  • p#first — finds all p tags with an id of first.
  • body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.

BeautifulSoup objects support searching a page via CSS selectors using the select method. We can use CSS selectors to find all the p tags in our page that are inside of a div like this:

soup.select("div p")Output:[<p class="inner-text first-item" id="first">
First paragraph.
</p>, <p class="inner-text">
Second paragraph.
</p>]

Note that the select method above returns a list of BeautifulSoup objects, just like find and find_all.

Explore page structure through chrome DevTools.

The first thing we’ll need to do is inspect the page using Chrome Devtools.

You can start the developer tools in Chrome by clicking View -> Developer -> Developer Tools. You should end up with a panel at the bottom of the browser like what you see below. Make sure the Elements panel is highlighted:

Chrome Developer Tools

Now if you want to find the rating of the historical place of India. From soup content, we can find the below code.

<div>
<span style=”margin-right:5px” class=”oqSTJd” aria- hidden=”true”>4.6</span>
</div>

So from that, we can understand that we have to find span class that has “oqSTJd” class and it’s a child of “div” tag.

Code for extracting rating of place.

In above code in soup.find_all() method contains attrs parameter. It contains either class or id. Now we can combine this data into Dataframe.

Combining our data into a Pandas Dataframe

import pandas as pd
place = pd.DataFrame({
"Name": placeName,
"Rating":rating
})
print(place)

Output:

Name      Rating
0 india gate 4.6
1 Kutub Minar 4.5

So from the data frame, you can perform Exploratory Data Analysis and apply some machine learning algorithm.

You should now have a good understanding of how to scrape web pages and extract data. A good next step will be picking a good web site and scrap important data and analyze it. You can also refer official documentation.

--

--

Avina Vekariya
DSC DDU
Writer for

Data Science Enthusiast | Full stack web developer