Introduction to Web-Scrapping with Beautiful-Soup and Requests

Gaurav Rajesh Sahani
Analytics Vidhya
Published in
5 min readSep 18, 2020

Web scraping is the process of gathering information from the Internet, As we know data has become the new oil, web scraping has become even more important and practical to use in various applications. Web scraping deals with extracting or scraping the information from the website.

Web scraping deals with extracting or scraping the information from the website extraction. Copying text from a website and pasting it to your local system is also web scraping!

Talking about Beautiful-Soup, Beautiful Soup is a pure Python library for extracting structured data from a website. It allows you to parse data from HTML and XML files. It acts as a helper module and interacts with HTML in a similar and better way as to how you would interact with a web page using other available developer tools.

And, It usually saves programmers hours or days of work since it works with your favorite parsers like lxml and html5lib to provide organic Python ways of navigating, searching, and modifying the parse tree. And, The Request module allows you to send HTTP requests using Python.

We can install Beautiful-Soup and Request, by the following command:

pip install beautifulsoup4
pip install requests

Well a most important point to get cleared here, “This Blog is just for the educational purpose, also doesn’t encourage to web scrape data without any prior written permission, or in disregard of their Terms of Service (ToS)”.

Now, Let’s get started!

We will be scrapping the links from the following website given below: https://www.whitehouse.gov/briefings-statements/

One of the things this website consists of is records of presidential briefings and statements. While our “Goal” would be to extract all of the links on the page that point to briefing and statements respectively!

We will implement our code on the Anaconda environment, and start importing the important libraries,

import requests
from bs4 import BeautifulSoup

Using the requests module, we use the “get” function provided to access the webpage provided as an argument to this function.

result = requests.get("https://www.whitehouse.gov/briefings-statements/")

Well, To make sure that the website is accessible, we can ensure that we obtain a “200” response to indicate that the page is indeed present, we can achieve this by passing a code:

print(result.status_code)

We can also check the HTTP header of the website to verify that we have indeed accessed the correct page:

print(result.headers)

Now, let us store the page content of the website accessed from requests to a variable “var” respectively:

var = result.content

Till now, we have just taken the content from the specific website or page, and stored it in a variable “var”. You can view the content just my passing the command: print(var)

Now that we have the page source stored, we will use the Beautiful-Soup module to parse and process the source. To do so, we create a “Beautiful-Soup” object based on the source variable we created above:

soup = BeautifulSoup(src, 'lxml')

lxml” is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.

Now that the page source has been processed via Beautifulsoup we can access specific information directly from it. For instance, say we want to see a list of all of the links on the page!

Well, our job is to extract all of the links on the page that point to briefing and statements, So how can we achieve that?

We can achieve this by visiting the page we want scrap information, and follow the simple steps, as a reference to scrape any permitted website.

Step 1: Visit the Website we want to scrap

Let’s visit the website we want to scrap, which is https://www.whitehouse.gov/briefings-statements/

Figure: 1

Step 2: Click the “Inspect” button

Right-click your mouse, and at the bottom, you will come across a “inspect” button, which will pop up an Html script of the specified page, we want to scrap!

Which would look like this,

Figure: 2

Now, we select the text/link we want to scrap for a reference, for which the HTML tags(h1,h2,h3,h4), class, hyperlink element “<a>” would be the same for all links we want to scrape!

Taking a zoom look! Let’s see in what tag, our link is enclosed!

Figure: 3

Here we can see, the link we want to scrap is enclosed in the “h2” HTML tag, and the hyperlink element <a>.

Step 3 : Extracting the <a> element from the html (Optional)

Beautiful soup provides us with great features to extract specific contents of the website we want to scrap, one of the command being,

data = soup.find_all("a")

Here, we find and store each and every content under the hyperlink element <a>, and store the data in the variable “data”. Also, we can see the data, py passing the command, Print(data).

Step 4: Storing the specific content in a list

As we can see, in Step 3, we extracted the hyperlink element individually, using a find_all command provided by Beautiful-Soup.

In Step 4, we will get this individual step inside a “for loop”, which would Iterate every <a> element, which is enclosed in the “h2 tag”(Please have a look at figure 3), for all the links in the website.

We will be achieving this by passing the following code,

links = [ ] #Creating a list, in which we will store all our linksfor h2_tag in soup.find_all('h2'):
a_tag = h2_tag.find('a')
links.append(a_tag.attrs['href'])
print(links)

And, These are the scrapped links we have scrapped from the respective website, which you can get as an output, after running the above command respectively.

Figure: 4 (Output data)

Well, this was all from my side, if you really like the explanation as well as the content, please do “Clap” on Medium.

Please refer the code for this tutorial, from the GitHub link I have attached below,

GitHub Link:

Also, Connect with me on Linkedin, and give genuine reviews for the tutorial respectively,

Linkedin Link:

--

--

Gaurav Rajesh Sahani
Analytics Vidhya

Machine Learning, Deep Learning, and Cloud Enthusiast. Like to explore new things, enhancing and expanding my knowledge each and every day!