CodeX
Published in

CodeX

CODEX

Using Beautiful Soup’s SoupStrainer to Save Time and Memory When Web Scraping

Analyzing an incredible feature

Photo by Jordane Mathieu on Unsplash

The usual way of doing things

The code above shows that the BeautifulSoup constructor (on line 12, imported as soup) takes in arguments of a site’s HMTL and a string indicating the type of parser that we need. This creates a BeautifulSoup object that we can use to scrape data from. Scraping a page or two from a website with this setup should be fine. However, if you need to scrape multiple pages of data, then this is probably not the way to go. The BeautifulSoup object by default will parse the entire page of HTML that we provide it with. We then have to use the find_all() method (on line 13) to extract the specific tags of HTML that we are interested in. Continuously creating a parser for an entire page when we only need some section of it is a waste of time and memory. There has to be a better way.

How does the SoupStrainer work?

SoupStrainer allows us to specify what items within our site’s HTML are to be parsed. We can indicate that we only want to parse all p tags or anything with a certain class/id value (these are only a few options out of many). If we are scraping multiple pages of data, then we can make sure that only the needed information is parsed. The benefits of using this feature are substantial.

Incorporating the strainer

The code above shows how we can use the SoupStrainer object with the BeautifulSoup object. The SoupStrainer constructor (on line 13, imported as strainer) is called to create an object with arguments that provide the parsing specifications. The result of this call is stored in the only_item_cells variable. The BeautifulSoup object is created as usual with arguments of a page’s HTML and the needed parser. However, the only_item_cells variable is provided as a third argument for the parse_only parameter. The parse_only parameter will create a BeautifulSoup object that will only parse particular items (only div tags with a class value of “item-cell” in this case). Now, we are left with a parsed HTML document containing only the div tags we want each time the BeautifulSoup constructor is called.

Note: A list is returned upon calling the find_all() method from a BeautifulSoup object. Because a SoupStrainer was used, there is no need to use the find_all() method. The div elements we want are already incorporated into our parsed HTML. Therefore, we need to convert our page_soup into a list (on line 15) to be able to effectively access the data.

Seeing the results

Image by author

As you can see in the image above, the page_soup is converted into a list. The list consists of 37 items (graphics cards). Each item in the list will be an individual div element matching our specifications. The BeautifulSoup object has been successfully modified. We are no longer constrained to scraping everything from a web page.

Applying SoupStrainer to an actual web scraper

Image by author

The image above shows the URL that is used to scrape data from. Each item-cell represents an individual graphics card from the Newegg website. The number of graphics cards on each page varies. A site like this is the perfect application for web scraping and making use of the SoupStrainer.

Note: The URL in the image above will be slightly modified in the code below with the addition of the “page” query parameter. This query parameter will just allow us to scrape multiple pages of GIGABYTE graphics cards.

When you visit Newegg’s website for graphics cards, you will find a manufacturer checkbox. This checkbox can be used to indicate what graphics cards are to be shown. On line 9, I have created a dictionary that represents a smaller version of this checkbox. Each key is the name of a manufacturer and its corresponding value is the identification number for that manufacturer. This dictionary is used to create a URL with a set identification number for the “N” query parameter. I then used predefined numbers from checking Newegg’s site to provide values for the “page” query parameter. After turning each page’s HTML into a soup, then the final URL can be used to scrape multiple pages of data for just the GIGABYTE graphics cards.

I would like you to take a step back and think about how powerful this feature is. We are scraping six pages of data in the example above. If you would have used the normal convention (only incorporating BeautifulSoup), then every element from a page would be a part of the corresponding parsed document. However, each time a parsed document is generated for a page, it only contains the elements associated with the graphics cards. This reduction in parsed items results in time and memory being saved as the number of pages continues to rise.

Note: I have written a set of articles that talk about building a simpler version of this web scraper (Building a Newegg Web Scraper (Part 1) and (Part 2) respectively). In those articles, I provide information on the tools needed for the build and in-depth code explanation. This is the reason why I have not gone into full detail about how the code works in this article.

References

Richardson, L. (2020). Beautiful Soup Documentation. Crummy. Retrieved from https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Data Science Dojo. (2017, January 6). Intro to Web Scraping with Python and Beautiful Soup. [YouTube video]. Data Science Dojo. Retrieved from https://www.youtube.com/watch?v=XQgXKtPSzUI&t=205s

My GitHub repo for the web scraper code.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store