Web Scraping With Python Using BeautifulSoup and MongoDB
Web scraping refers to the extraction of data from a website.
With practically limitless data floating around the web, web scraping is a very important tool to use this data for useful purposes, it’s possibilities and use cases are endless. It’s especially useful for every programmer to understand and be able to use it.
What we’ll be building
In this post we’ll be building a scraper for http://quotes.toscrape.com/ . It’s a website made specifically for practicing web scraping. It has some famous quotes which we’ll scrape along with their author and tags and then later we’ll save them to a database.
It also has pagination, so we can learn how to iteratively scrape the quotes from each page.
Before we begin, please make sure you have setup your python environment. If you haven’t visit their official page and install it on your machine.
We’ll later also install the needed modules for this project.
Creating the project
Create a folder on your machine, I’ll call mine quote_scraper, then open your terminal and cd into that folder. Now we’ll need to create a virtual environment so on your terminal run:
python3 -m venv env
Wait for it’s completion and you’ll see that an env folder was created inside your folder. This will contain the packages we’ll install later, that Python will need to run your project.
Let’s activate the virtual environment:
Installing the dependencies
On the terminal we need to install the packages we’re going to use using pip which comes preinstalled with Python. So let’s run
pip3 install requests beautifulsoup4
Wait for them to install…Then create a python file in your project’s folder I’ll call it index.py .
Scraping the data
The first thing that needs to be done when scraping a website, is open it and inspect it’s elements(html tags). So open your favorite browser and visit http://quotes.toscrape.com/ and open inspect element.
Now we need to start looking for patterns in the website’s html structure, every well written website should have a semantic structure to it’s elements which allows us to locate the elements that we’re looking for, in our case the quotes.
After looking inside a few tags we can see that every quote element exists inside a div with the class quote.
And if we look inside a single quote element, we can see the quote text inside a span with class text, a small element with class author and a div with the class tags which contains many a(link) elements with class tag. We will make use of these elements to get every single quote, author and tag in this website.
Now that we’ve understood which elements we need to scrape, we can open the index.py file that we created earlier in our favorite editor or IDE and start coding. I’ll be using Sublime Text, you can use any editor you like.
In this code we have used BeatifulSoup’s select and select_one functions, these functions allow us to select elements using css selectors so they are very convenient to use, there are other functions as well(find, find_all etc.).
As you can see we have a for loop, that’s because we’ve used the select function,it selects all the elements containing the class quote and returns them as a list over which we can iterate. We know we have only one element for the quote text and author, so we use select_one. Then we use another for loop to get all the tags, finally we append the quote to the quotes list.
And in the end we print all the quotes we’ve scraped. Add this code to your index.py, save and in the terminal run it to see the result:
If you’ve noticed though in the above code we’ve only scraped the first page of the website, yet the site contains other pages with a whole lot of other quotes. We know it contains many pages which we can access by appending /page/num_of_page to the url.
But how can we know how many pages there are? For all we know there could be 3 or 1000… We need a breakpoint, so let’s go back to the site and open inspect once again.
Near the end of the page we’ve found the next page link, we can use this element to know if there are more pages, if the element exists we go to the next page, else we return the result.
This is the code we have now. We’ve created a function to make it a little bit nicer, nothing too fancy. We’ve added a page variable to keep track of the pages, and we’ve selected the next page element to know if we should get to the next page or end the loop and return the quotes.
Add this to your file and run it again. Now you’ll get all the quotes from the site.
Storing the quotes in the database
To store the quotes first we need to create a database, I’m going to use mongodb and it offers a great cloud database service called Mongodb Atlas, head over to https://www.mongodb.com/cloud/atlas and sign up if you don’t already have an account or you can use a local mongodb database if you like.
After signing up in mongo atlas create a new cluster and choose the free tier, follow the instructions, it’s super easy. After creating it click connect > connect to your application and copy the connection string.
Now to interact with the database we need to install two modules:
pip3 install pymongo pymongo[srv]
We simply add a mongodb connection and we save all the quotes in the database at once, this is the final code:
Now replace ‘your mongodb connection string’ with your mongo atlas connection string and run the code and you should see “inserted 100 articles” printed on the terminal. You can also view the data in your mongodb atlas instance, you’ll find it in db > quotes.
Congrats you’ve successfully scraped the data and stored it in a database in very few lines of code!
Now the possibilities are endless, you can get tons of data from all over the web in a matter of seconds. To learn more about BeautifulSoup check out their docs.
Also one very important thing to remember when scraping data is to not send too many requests to a website in order to not disrupt their server, you could get blocked, there are also sites that do not allow web scraping on their site, so use your new found powers safely.
To learn more about web scraping using a hands-on approach, checkout my Udemy course: Web Scraping in Python with BeautifulSoup by Example