How to get started with Beautiful Soup?

Prakash R
featurepreneur
Published in
2 min readAug 26, 2021

The incredible amount of data on the Internet is a rich resource for any field of research or personal interest. To effectively harvest that data, you’ll need to become skilled at web scraping. The Python libraries requests and Beautiful Soup are powerful tools for the job

Web Scraping:

Web scraping is the process of gathering information from the Internet. Even copying and pasting the lyrics of your favorite song is a form of web scraping! However, the words “web scraping” usually refer to a process that involves automation.

To use beautiful soup, you need to install it:

$ pip install beautifulsoup4

Beautiful Soup also relies on a parser, the default is lxml.

To install lxml;

$ pip install lxml
or
$ apt-get install python-lxml

To begin, we need to import Beautiful Soup and urllib, and grab source code:

import bs4 as bs 
import urllib.request
source=urllib.request.urlopen('https://pythonprogramming.net/parsememcparseface/').read()

Then, we create the “soup.” This is a beautiful soup object:

soup = bs.BeautifulSoup(source,'lxml')

If you do print(soup) and print(source), it looks the same, but the source is just plain the response data, and the soup is an object that we can actually interact with, by tag, now, like so:

# title of the page
print(soup.title)

# get attributes:
print(soup.title.name)

# get values:
print(soup.title.string)

# beginning navigation:
print(soup.title.parent.name)

# getting specific values:
print(soup.p)

Finding paragraph tags <p> is a fairly common task. In the case above, we're just finding the first one. What if we wanted to find them all?

print(soup.find_all('p'))

We can also iterate through them:

for paragraph in soup.find_all('p'):
print(paragraph.string)
print(str(paragraph.text))

The difference between string and text is that string produces a NavigableString object, and text is just typical unicode text. Notice that, if there are child tags in the paragraph item that we’re attempting to use .string on, we will get None returned.

--

--