Web Scraping with Beautiful Soup part 1

Jitendra Kumar
5 min readFeb 2, 2023

--

Beautiful Soup is a third-party library in Python that allows you to parse HTML and XML files and extract useful information from them. The library is widely used for web scraping purposes as it makes it easier to extract data from web pages.

Web scraping is a process of extracting data from websites. In order to extract the desired data from a website, the first step is to download the HTML or XML source code of the page. The next step is to parse this source code and extract the desired information. This can be a tedious task as HTML and XML files are not very readable, and the data is often spread across different elements in a hierarchical structure.

Photo by Ben White on Unsplash

Beautiful Soup helps in solving this problem by providing methods and objects that make it easier to extract data from HTML and XML files. It creates a parse tree from the source code of a web page which represents the structure of the data in a hierarchical manner. The parse tree can be used to extract the data by navigating through the tree and accessing the elements and their properties.

Some of the key features of Beautiful Soup include:

  1. Easy to use: Beautiful Soup provides a simple and intuitive interface for extracting data from HTML and XML files.
  2. Supports multiple parsers: Beautiful Soup supports multiple parsers such as lxml, html5lib, and xml. You can choose the parser that best fits your needs.
  3. Unicode handling: Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8.
  4. Robustness: Beautiful Soup is designed to handle malformed HTML and XML files, making it a robust tool for web scraping.

Before installing the beautiful soup we need to know about one more library which is requests

The requests library is a popular Python library used for sending HTTP requests. It allows you to send HTTP/1.1 requests extremely easily and it is one of the simplest libraries available.

In web scraping, you often need to send a request to a website to retrieve the HTML or XML source code of a page. The requests library provides a simple and intuitive interface for sending HTTP requests, including GET requests which are used to retrieve the source code of a web page.

pip install requests

For example, the following code uses the requests library to send a GET request to a website and retrieve the response:

import requests
url = "https://www.example.com"
response = requests.get(url)

In this code, the requests.get method is used to send a GET request to the specified URL, and the response is stored in the response variable. The response variable contains various information about the response, including the status code, headers, and the content of the response.

The content of the response can be accessed using the response.content attribute, which returns the content as a bytes object. This content can then be passed to Beautiful Soup for parsing and extracting the data.

Installation

Beautiful Soup can be installed using the pip package manager by running the following command in your terminal or command prompt:

pip install beautifulsoup4

Installation is done so lets start the scraping here I'm using the wikipedia website the scrapping

In first step make the request to the given url

import requests
url = "https://en.wikipedia.org/wiki/Web_scraping"
response = requests.get(url)
print(response.content)

output:

In this code, the requests.get method is used to send a GET request to the specified URL, and the response is stored in the response variable. The response variable contains various information about the response, including the status code, headers, and the content of the response.

The content of the response can be accessed using the response.content attribute, which returns the content as a bytes object. This content can then be passed to Beautiful Soup for parsing and extracting the data.

import requests
from bs4 import BeautifulSoup

# Make a GET request to the website
url = "https://www.example.com"
response = requests.get(url)

# Create a BeautifulSoup object and specify the parser
soup = BeautifulSoup(response.content, "html.parser")

# Find the desired elements in the HTML code
title_tag = soup.find("title")
header_tags = soup.find_all("h1")

# Extract the data from the elements
title = title_tag.text
headers = [header.text for header in header_tags]

# Print the extracted data
print("Title:", title)
print("Headers:", headers)

soup = BeautifulSoup(response.content, "html.parser"): This line creates a BeautifulSoup object from the content of the response, and specifies the parser as "html.parser". The BeautifulSoup constructor takes two arguments: the first is the content that you want to parse, and the second is the parser that you want to use. In this case, we are using the html.parser, which is a built-in parser in the Python Standard Library.

#this will print all the text present on the page
print(soup.text)

if we have to find the text by tag like h1,h2,title then we use find() function to find all the text

h1=soup.find("h1")
print(h1)

Output

it will return the HTML code if we want plain text then we have to print(h1.text) it will return only text i.e Web scraping

Similarly we can find by different tags like h1,h2,p,a or many more.

If we want to find all the h2 tags present in the page then we have to use find_all function it will return the list of all text which were present in the page.

h2=soup.find_all('h2')
print(h2)

Output:

if we want to print the text then we will have to iterrate the list

for i in h2:
print(i.text)

output:

Similary we can find the links present the page

link=soup.find_all("a")
print(link)

it will list of all the “a” tag

for i in link:
print(i.get('href'))

Output:

to find the links inside the HTML we need to use the get() function and extract the “href” which will contain the link.

In next section we will cover how can we find the text by using the class name or id so follow me to learn the web scraping :)

Hope you enjoyed the learning!!

--

--