Web Scraping Using Beautiful Soup and Requests in Python

Need to extract useful data from different websites? Hop-On.

Siddharth Singh
The Startup
4 min readSep 12, 2020

--

@webharvy.com

What is Web Scraping?

Web Scraping is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format.

Why Web Scraping is important?

Web Scraping helps you to get the data that you want from the Web. These data are extracted from millions of URLs based on your requirement. These data play a vital role in key decision making in their business. These data will be used based on their needs. The data can be used to determine dynamic pricing. Reviews will help to understand the seller’s quality.

Now that you know why web scraping is important, let us move on to the libraries and modules we will using to scarp websites.

We will be using two of the most famous libraries and modules out there that are Beautiful Soup and requests.

Beautiful Soup is a python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.

The requests module allows you to send HTTP requests using Python.

I’ll be scraping https://www.trustpilot.com/ for reviews and then perform some sentiment analysis on it. This site is very good for starters as it is regularly updated with new reviews.

Before starting you need to have this installation in place:

  • Python (latest version)
  • Beautiful Soup
pip install beautifulsoup4
  • requests
pip install requests

Let us see the content of the website we are gonna scrap.

https://www.trustpilot.com/

Let’s Code!

Step 1: Create a python file (say reviews.py)

Step 2: Import the libraries and modules

from bs4 import BeautifulSoup
import requests

Step3: Send the HTTP request and store it in variable

url="https://www.trustpilot.com/"
req=requests.get(url).text

The get() method sends a GET request to the specified URL.

.text converts the response into simple text.

Step 4: Parse the HTML data (req)

soup=BeautifulSoup(req, 'html.parser')

The html.parser is a structured markup processing tool. It defines a class called HTMLParser, ​which is used to parse HTML files. It comes in handy for web crawling​.

We create a BeautifulSoup object by passing two arguments:

  • req: It is the raw HTML content.
  • html.parser: Specifying the HTML parser we want to use.

Step 5: Searching and navigating through the parse tree (HTML data)

Now, we would like to extract some useful data from the HTML content. The soup object contains all the data in the nested structure which could be programmatically extracted. In our example, we are scraping a webpage consisting of some reviews. So, we would like to create a program to save those reviews (and all relevant information about them).

For that, we first need to inspect the website and see which class or div contains all the reviews.

https://www.trustpilot.com/

Now that we know which div class we need to target (‘reviewCard___2KiId’), let’s write the code.

reviews=soup.find_all("div", class_="reviewCard___2KiId")

.find_all() method is one of the most common methods in BeautifulSoup. It looks through a tag’s descendants and retrieves all descendants that match your filters.

Now, we have stored all the reviews of the page into a variable called reviews . We can loop through it to get all the reviews and print them one by one.

Step 6: Looping through the extracted data to get relevant information

for review in reviews:
author=review.find("div", class_="author___3-7MA").a.text.strip()
rev=review.find("div", class_="reviewCardBody___2o5Ws").text
print(author)
print(rev)
print()

Since the div with the class name “reviewCard___2KiId” has a lot of data in it, we need to parse down further to get the author’s name and the review itself.

Therefore we use .find() to find the “author___3–7MA” class div and then further navigate to the anchor tag and extract the text from it. .strip() is used to remove the extra spaces present.

For the review, we need to navigate to the “reviewCardBody___2o5Ws” class div and extract the text from it.

Then simply display the results.

If you want to store it in a .csv file then visit the link.

Here you go, you just scrapped a website. Hurray !!

Full Code

Quick Note: If you want to perform sentiment analysis on each review then visit the link(do give a star if you like it).

Check out my other article on Twitter Sentiment Analysis if you are more into sentiment analysis.

Quick Note: Web Scraping is considered illegal in many cases. It may also cause your IP to be blocked permanently by a website.

Conclusion

In this article, I have shown you a simple example of how to create a web scraper in Python. From here, you can try to scrap any other website of your choice. In case of any queries, post them below in the comments section.

If you require more examples then visit the link(do give a star if you like it).

Thanks for reading this article. I hope it’s helpful to you all!

Happy Coding !!

--

--