Web scraping with python

Using Beautifulsoup and requests libraries to crawl through a website and extract data from multiple pages

Darren Willenberg
MLthinkbox
5 min readNov 4, 2022

--

What is web scraping?

Web scraping, web harvesting, or web data extraction is a technique that takes data from websites and manipulates it for further analysis. This technique can be used to gather market research, track prices, analyze performance metrics, or aggregate data from multiple sources. Web scraping is one of the most popular ways to extract data on the internet. Web scraping can be done with a variety of programming languages, but Python is one of the most popular.

Scrape responsibly

It is important to be respectful of the websites you are scraping. This means not scraping too much data, not scraping too often, and not scraping sensitive data. Taking care to design your webscraper effectively will make sure that you do not accidentally break any website you are scraping. Always check the robots.txt of the target website to understand your webscraping limits e.g https://www.enchantedlearning.com/robots.txt.

A robots. txt file may specify a “crawl delay” directive for one or more user agents, which tells a bot how quickly it can request pages from a website. For example, a crawl delay of 10 specifies that a crawler should not request a new page more than every 10 seconds. For more info click here.

What is BeautifulSoup?

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. You can find more details below.

Code Example: Extracting data from the enchanted learning website

So I had to perform a task recently that required me to scrape data from a website. I had to write a job that would crawl through a website consisting of 100s of pages to generate a table of keywords from the home page and a list of words that could be found by clicking on the keyword URL. The desired output is the below pandas dataframe.

The website is shown below.

Homepage of https://www.enchantedlearning.com/wordlist/

Solution strategy

We will utilize the following modules in our project:

  1. Initially, we will connect to the website using the requests library to pull all website data as HTML,
  2. We will then look for patterns in the data by inspecting the HTML and CSS,
  3. We will then use the beautifulsoup library to extract data falling within patterns,
  4. We will also make use of the time library so that our application can sleep for specific time intervals to prevent irresponsible web traffic on the target website,
  5. Finally, we will make use of the pandas library for displaying the data.

The overall workflow looks something like below.

The solution

Import modules

We first use the requests library to pull the website data in Html format.

Pulling the HTML content through beautifulsoup we can see that data is more structured. We can see that our words of interest are located within the <a> tag. The <a> tag defines a hyperlink, which is used to link from one page to another.

Two commands can be used to retrieve repeating patterns from a website, namely, find()or select().

  • find()accepts only HTML selectors
  • select() on the other hand can be used to select CSS as well.

In this example, we will be selecting the pattern based on CSS. To do this we right-click on the webpage and click on inspect (in google chrome).

We can see that our data of interest is located within the card__content CSS class.

Considering all of the above. After entering our pattern into the select statement we obtain a list of all links and their description.

We can use the text.strip command to extract only the link descriptions

We can use href to pull the URL.

Because we are interested in the list of keywords, their URLs, and lists of words that can be found within each URL we will need to structure our dictionary accordingly. We create a dictionary called ‘categories’ which will take the link description as the key and a tuple of the URL and empty list to be populated during the crawl.

To keep things short I created the below function for iteration.

Inspecting the categories dictionary we can now see that we have populated our dictionary with new words that were pulled from the corresponding URL

Comparing with the URL we can see that the words pulled are exactly as expected.

https://www.enchantedlearning.com/wordlist/art.shtml

Finally, we use pandas from_dict to create a dataframe from the dictionary and apply various cosmetic adjustments such as filtering data and changing column headings.

Conclusions

  • Using BeautifulSoup and requests libraries is an effective way to extract data from web pages,
  • Inspecting your target web pages and understanding how to storeyour data is important,
  • Add a sleep timer so that you can responsibly extract data,
  • The BeautifulSoup documentation is quite extensive and there are many additional ways to extract data

The project Jupyter notebook can be found here. If you find a better way to do this, do not hesitate to let me know! Thanks for reading!

--

--

Darren Willenberg
MLthinkbox

Engineer | Analyst | Data Science Enthusiast | UCT | MLthinkbox Publication Founder