Web Scraping using BeautifulSoup

Sameena Fathima M
Analytics Vidhya
Published in
3 min readJul 31, 2020

In today’s world where there are lot of websites available,analysing the data in them becomes difficult without an efficient method.Here is where web scrapping comes into play.Web scraping is the process that automates the extraction of data from websites in fast and efficient manner.Among the various python libraries available for web scraping,BeautifulSoup is one among them.

BeautifulSoup is a library for parsing HTML and XML documents creating a parse tree thus allowing us to extract and analyse data.

Installing Beautiful Soup

Beautifulsoup can be installed using BeautifulSoup installed using the Python package manager pip.

pip install BeautifulSoup4

It can also be installed using the anaconda package manager.

conda install beautifulsoup4

Let’s look at an example of scrapping wikipedia that extracts the names of the states in India

To access the HTML content of a webpage,
1.Import python’s requests and BeautifulSoup libraries
2.Provide the website url needed for scraping
3.Get the HTML data by performing a HTTP request to the specified URL and store the response in an object

from bs4 import BeautifulSoup
import requests
url='https://en.wikipedia.org/wiki/States_and_union_territories_of_India'
response = requests.get(url)

Next,create a BeautifulSoup object by passing the document to be parsed and specifying the type of parser as parameters.BeautifulSoup supports parsers such as html.parser,lxml and html5lib.It will use HTML parser unless it is specifically mentioned to use XML parser.By default, Beautiful Soup supports the HTML parser which is included in Python’s standard library.

soup = BeautifulSoup(response.text,'html.parser')

prettify() is a function that enable us to view how the tags are nested in the document.

Now let’s inspect the HTML script in which the data for getting the names of the states is found under the table with class wikitable sortable plainrowheaders.

Now,we want to extract only the names of the states under State column which is found in <th> with the scope attribute as “row” under the retrieved table.Finally,we can print all the names of the states found in the title attribute of <a>.

I hope this article was useful to know how web scrapping can be done using BeautifulSoup in a fast and easy manner.Thanks for reading!!!

--

--