Easy Web Scraping with Python BeautifulSoup

Felicia
6 min readJan 4, 2019

--

Web Scraping

I first started learning about web scraping using Selenium, an open-source framework for automated testing. We had needed a way to test the browser’s User Interface for correctness in legacy applications. Selenium IDE is so simple, it can be learned in minutes for basic functionality. If you need any “advanced” programming need, such as for or while loops, you need to graduate to Selenium WebDriver coupled with one of the many programming languages it supports. In my case, I used Java to write automated testing scripts. However, Java isn’t the fastest language to learn, and the Eclipse IDE configuration wasn’t the easiest to set up.

Now that I have learned Python, web scraping seems much simpler with Beautiful Soup, an open-source framework. You don’t have to tirelessly “walk” the DOM if the elements do not have proper ID attributes. In Beautiful Soup, the DOM elements (<a>, <div>, <p>, etc) can be aggregated into an array with one command. Here’s a quick tutorial based on the work created by Antonia Blair. Her explanations helped me learn Beautiful Soup in an amazingly short amount of time.

Setup

I use Vagrant as my Linux environment running Ubuntu (trusty64) v14.04. You will need to install:

  • python3 (my version is 3.4.3)
  • requests module (version 2.2.1).
  • BeautifulSoup module (version 4.2.1).

I used this to check my module versions.

Check python module versions

Note, if you use python (version) 2, you will use pip, not pip3.

Basic BeautifulSoup Code

Once everything is set up, let’s see what HTML content looks like at PyLadies (https://www.pyladies.com) homepage.

With just a few lines of python code, we include the modules, retrieve the contents, and then print out the HTML code to the screen. It is remarkable how short this python program is. Let’s call this program, beautifulSoup.py. And, make sure you set the right Linux file permissions with

$ chmod 755 beautifulSoup.py

The program is below.

#!/usr/bin/python3
import requests # Include HTTP Requests module
from bs4 import BeautifulSoup # Include BS web scraping module
url = "http://www.pyladies.com" # Website / URL we will contact
r = requests.get(url) # Sends HTTP GET Request
soup = BeautifulSoup(r.text, "html.parser") # Parses HTTP Response
print(soup.prettify()) # Prints user-friendly results

To run this program, type:

$ ./beautifulSoup.py

A small screenshot of running the program is below. The HTTP response sent back from the Pyladies.com’s web server is below:

Results of running a simple Beautiful Soup program

You can see what your browser requires to display the index.html page.

If you want to display the HTML status code, just add a single command below, where 200 is the standard response for successful HTTP Request. The program now looks like:

#!/usr/bin/python3
import requests # Include HTTP Requests module
from bs4 import BeautifulSoup # Include BS web scraping module
url = "http://www.pyladies.com" # Website / URL we will contact
r = requests.get(url) # Sends HTTP GET Request
print(r.status_code) # ---> Print HTML status code <---
soup = BeautifulSoup(r.text, "html.parser") # Parses HTTP Response
print(soup.prettify()) # Prints user-friendly results

You can see only one line of code was added.

print(r.status_code)

The result is below

HTTP status code of 200 (successful HTTP Request) is now outputted

In this blog, data is stored in my soup variable. You, of course, can name your variable any name you want.

Finding a Match in the BeautifulSoup object

find() Method

find() is one of the best features in BeautifulSoup. It helps aggregate DOM elements easily so you can manipulate what you need.

Knowing which HTML element you want on a webpage is half the battle. To do this I like to use the Google Chrome browser’s Inspect feature. On a Mac, if you hover over the element you want to grab (in this instance, the “Buy Stickers” button on the pyladies.com, and 2-finger press, a menu opens with the “Inspect” option. On a Windows machine, it’s a right-click while hovering over the element with a similar menu option. To access web page elements in other browsers, read more here.

How to Inspect the DOM of a webpage
Identifying the “Buy Stickers” button on the webpage’s HTML code

Once you uniquely identify the element, then you can use BeautifulSoup’s find() to locate it. In this case, it’s

soup.find('div', id="stickers_btn")  # Use print() for the results

Printing the results display the following.

Adding “print(soup.find(‘div’, id=”stickers_btn”))”

title(), h1(), body() Methods

Other useful ways of locating the right HTML element.

# returns the first div on the page
soup.find('div')
# find the first div with id='welcome_message'
soup.find('div', id='welcome_message')
# finds the respective HTML tag element
soup.title
soup.h1
soup.body.div

find_all() Method

Now, if you want to put all of the same type of elements into an array, BeautifulSoup has find_all().

soup.find_all('a')      # finds all <a> elements
soup.find_all('a')[0] # reference the first <a> element
soup.find_all('a')[1] # reference the second <a> element

Once you have them in an array, now you can iterate over your data. This is the power of using a programming language. This is when I found Selenium IDE lacking and shifted over to Selenium WebDriver and Java. Looping through elements was vital to manipulate the data and being able to use program logic.

for link in soup.find_all('a'):  # iterate over every <a> tag
print(link) # print it to the screen
Print each <a> tag in pyladies.com

get_text() Method

But this can be hard to read. BeautifulSoup’s get_text() comes to the rescue. Changing the code to:

for link in soup.find_all('a'):  # iterate over every <a> tag
print(link.get_text()) # print it to the screen
Print the text in each <a> tag in pyladies.com

get() Method

If you want to get all the links on a page, get() is very useful.

for link in soup.find_all(‘a’):
print(link.get(‘href’))
Using get() to find all links on a webpage

Discovering the Rest of BeautifulSoup’s Methods

If you want to see the many possible commands in Beautiful Soup, you can use the python’s Interactive Mode and use the double tab feature, <tab><tab> after the [object name] and the period ”.” to list the possibilities. This is similar to the dir() python method.

Enter the program above into python.

$ python3Python 3.4.3 (default, Nov 28 2017, 16:41:13)[GCC 4.8.4] on linuxType "help", "copyright", "credits" or "license" for more information.

Hit <tab><tab> quickly at the “soup.” text you just entered (including the period without spaces).

Generate a list of Beautiful Soup commands in python Interactive Mode using <tab><tab>

In Summary

Python is a wonderful language, and the many modules help to make it easier to achieve your programming goals. I hope this was a useful to those who just started learning about BeautifulSoup like me.

Many thanks to Antonia Blair (antoniablair@gmail.com) for her tutorial upon which this was based and Pyladies (New York City chapter) that is helping me master python.

Felicia Hsieh is a software engineer in career transition, looking for a software engineering / devops role in the NYC/NJ area (or remote). She has an MBA, BSCS, and BSEE.
Github: www.github.com/feliciahsieh
Email: 214@holbertonschool.com

--

--