Python Web Scraping Using BeautifulSoup

Navoda Nilakshi
CodeX
Published in
6 min readDec 12, 2021
Image by Author via Canva

Hello World! In this article, we are going to look into the nitty-gritty of web scraping in python. But first, it is important to know what web scraping actually means and why we should care about it.

Imagine you are trying to analyze data on an e-commerce website. You need data from that specific website to be accessible to be manipulated to your code. A typical way of doing this would be to make a request to the website’s API endpoint and grab the data you want. But this is not always possible because there are websites that don’t present us with a nice API. There are also instances where you have to pay in order to access the API. So how do we go about this? Well, that’s where web scraping comes in handy.

Web scraping allows us to quickly and painlessly grab the data we want from a website. Like I mentioned, this is super helpful when the site you are after does not offer any API to extract needed data.

This is not to say there are no concerns regarding web scraping. There is actually legislature involved around the topic of web-scraping because when done with bad intentions, it can lead to theft of intellectual property or an unfair competitive edge. This means it’s not lawful to get data from a site and construct a whole other website with that data. Given below are a few rules of thumb you can abide by whenever you’re compelled to scrape.

  • If you can find a public API that can fetch the data you want, always use that API instead of scraping.
  • You can consult the robot.txt file before you actually start scraping if you want to get a clear idea about the website’s request on what’s disallowed to scrape. You can view this file by adding ‘/robot.txt’ to the website URL. For example, here’s the GitHub version of the file. It looks like this:
  • Always be sure to request data at a reasonable pace. You can add a little time interval between requests to achieve this goal.
  • Always be polite and get the data you absolutely need rather than grabbing everything like you’re in a year-end sale.

Now let’s dive into how the web scraping is actually done. In python, we use a module called, bs4 to acquire BeautifulSoup which comes as a part of it. In addition, we do need requests module to actually send the HTTP request. Once you install both bs4 and requests modules, you can import them as shown below,

First, you have to make the request to the intended URL and get the response body back. There are a lot of sites that are made to just scrape and Quotes to scrape and Books to Scrape are few to name some. In this demo, I will be using Quotes to Scrape site. You need to make the request and save the response like this,

The response we get back is a String so we can’t access or manipulate it in the way we want. The next step is to create an instance of BeautifulSoup by passing the response you got.

If you inspect Quotes to scrape using devtools, you can see details of each quote is placed under a div with the class of ‘quote’

After identifying the CSS selectors of the container for each quote, all you have to do is to select them. It can be done like this,

.select() returns an array even if there’s only one result. In this case we have an array of divs(quotes). We can loop over the quotes array we just created and extract each quote and author or anything else we want. And then finally append each record as a dictionary to another array for the ease of use.

Quote text is located inside a span with a class of ‘text’. Author name in a small tag with the class ‘author’. We use .get_text() method to obtain innerText of an element. With the tailing print command you can see all the quotes and authors of the first page of the website in an array.

But more often than not, we need to extract data from all available pages of the website right? We want to click next button again and again and get all the data available in all pages rather than just one page.

This is how it is done:

First we need to locate the next button or similar navigation to the next page in the current page. Then we select that next URL, and keep on grabbing data until that next URL is not there anymore (That means we’re on the last page).

Final code to grab all of the available quotes look like this,

In the above example, I have wrapped everything in a function for the sake of re-usability. Also, by introducing function we can repeat the scraping only when we need to by calling the function rather than running the scraping process every single time we run the file.

Note that I have added 1 second gap between each request in the try block by adding sleep(1) to not to be harsh on the server. This is a good practice when we’re doing web scraping.

You can also save the result to a csv file if you prefer afterwards like this,

First you need to import ‘DictWriter’ from ‘csv’ module:

Then you can write a little function which will take in the quotes array we generated, and write it to a csv file.

Finally, you can execute both functions like so to generate the csv file with all the quotes and authors.

I hope you got a clear understanding about what web scraping is and how it is done. See you next time!

--

--