BeautifulSoup and MongoDB: How to scrape unstructured data from the web and store in a database with Python

Let’s learn python concepts by developing a program for scraping unstructured data from a webpage and storing the data in a combination of SQL and NoSQL databases.

Saurabh Ghosh

Published in

Predict

8 min readJan 30, 2023

What to expect in this blog?

In this blog, you’ll explore the retrieval and storage of unstructured data. The focus would be to learn below areas -

What is unstructured data — Unstructured data do not rely on the traditional row-and-column schema that relational databases use. Each record can have a different structure of data. Such data needs to be stored in a horizontally scalable architecture.

Sample unstructured data — An easy example of unstructured data is Wikipedia. In this blog though, you’ll use the IMDB webpage to retrieve details of movies. Each movie can have a different amount of metadata.

Scraping unstructured data — You’ll scrape details of movies from IMDB. Scraping will involve analyzing the page and identifying the right tag for the metadata you want to retrieve.

Storing data as documents in NoSQL database—You’ll use MongoDB for storing the scraped data. You’ll integrate your python program with the database using the PyMongo library. You’ll store some details in a relational database (MySQL) as well from the python program.

Some key points you’ll be exploring with this code -

Using BeautifulSoup to parse and read a webpage.
Using different methods of BeautifulSoup to find elements in the webpage e.g. find, find_all and select.
Connecting to a MongoDB instance.
Connecting to a MySQL instance.
Creating a collection in MongoDB instance and inserting documents into the collection.
Creating a table in MySQL instance and inserting records into the table.
Using dictionary data types.
Exception handling with try and except.

Disclaimer: While web scraping is an excellent way of programmatically pulling data off of websites, please do so responsibly. The script demonstrated here only uses the publicly available listing and publicly available metadata of movies. It does not try to circumvent any of the URLs prohibited by robots.txt available on the website. I am using this program only to demonstrate python programming concepts. The scraped data is not used for any other purpose other than this program. If you are exploring this demonstration yourself, please use your discretion.

Let’s plan for the work

Before starting the design and coding, let’s understand the requirement.

User interaction involved

This program will not have any user input required. When started, it’ll run until it finishes scraping all the genres and the top 50 movies in a genre are scraped. So, there is no user interaction involved.

System processing requirement

Use a predefined list of genres.
For each of the genres, open the webpage from IMDB.com that lists the top 50 movies in that genre.
For each movie in the top 50 list of one genre, read the title and the rank of the movie.
For each movie in the top 50 list of one genre, open the webpage that contains the details of the movie.
Read the basic details for each movie — genres, plot, director, writer, actor etc.
The program should be flexible enough to scrape additional metadata if present.
Create a collection for each genre in MongoDB. Store each movie of the genre as documents in the collection.
While reading the details of the movies, store the key roles and names in a MySQL table. This will be useful for future applications.

High-level design thinking

Now you can plan the required methods and attributes.

DBHandler class

You need a class to handle the interaction with the database. This is necessary so that you do not need to update your scraper code if there is a change in database connectivity or logic. For example, if you need to interact with a different database in future (maybe Oracle instead of MySQL or another document database instead of MongoDB), you’ll only update your DBHandler class.

This class will have below methods -

__init__() — This will create the connection to the MongoDB and MySQL databases before any other operation.
create_movie_roles_table() — This method will create a table in the MySQL database (if the table does not exist) to store the roles and names e.g. ‘James Cameron’ as ‘Director’ etc.
insert_role_to_db() — This method will create a new role and name pair in the MySQL database table.
create_movie_genre_collection() — This method will create a collection in MongoDB for the genre passed as a parameter. If the collection exists already, it’ll return the existing collection instance.
insert_into_movie_collection() — This method will take the movie details in a dictionary object and insert the dictionary as a document into the MongoDB collection for the genre.
close_connections() — This method will be called at end of the program to close the connections with MongoDB and MySQL.

MovieScraper class

This class will read the web pages and retrieve the information. It’ll call necessary DBHandler methods to store the retrieved information. Although the class will have one primary method to execute the process, there are different steps of processing as described below.

This class will have below parts -

Global variables — The module of MovieScraper class will contain multiple text constants. These are the page-specific element ids or style ids which are used in the program during scraping. You’ll find these in the GitHub repository linked at the end.
__init__() — This method will set the user agent text and the list of genres to be used for the program.
scrape_movies() — This is the primary method. Below are the primary steps -
— Open the web page to read the list of the top 50 movies for a genre.
— For each movie, open the page that contains more information about the movie e.g. plot, director’s name, writer’s name, actors’ names etc.
— Call the DBHandler methods to create collection and store details about the movie in MongoDB.
— Call the DBHandler methods to create a table and store the roles and names in the movies in the MySQL database.
— Call close_connections() method of DBHandler class at end of execution.

Let’s code

Now that you know the main methods and their purpose, now let’s get started.

You can follow the comments and docstrings within the code snippets.

“DBHandler” class

Import, class definition and __init__() method

Key points —
— Connecting to a MongoDB instance.
— Connecting to a MySQL instance.
— Catching and handling exception

create_movie_roles_table()

Key points —
— Creating a table in MySQL instance

insert_role_to_db()

Key points —
— Inserting records into MySQL table
— Using a parameterized query to avoid SQL injection
— Handling specific SQL exceptions to detect duplicate record insertion

create_movie_genre_collection()

Key points —
— Creating a collection in the MongoDB instance
— Delete existing documents from the collection

insert_into_movie_collection()

Key points —
— Inserting a python dictionary instance as a document in the MongoDB collection

close_connections()

Key points —
— Closing MySQL connection
— Closing MongoDB client

“MovieScraper” class

Any web scraping program is very closely coupled with the target webpage design. So, it is very important to analyze and identify the HTML elements from which you want to retrieve data.

Before starting on the code, let’s understand briefly how the target pages and screen elements are structured, and how you’ll retrieve the details.

You’ll first open the page which contains the top 50 movies for one selected genre.
You’ll read the rank, title and URL for the individual movies from this page and add them to a dictionary object.
For each movie, you’ll open the page which contains other details of the movie.
You’ll read the other metadata of the movie and add them to the dictionary object for the movie.
You’ll add this dictionary object to the MongoDB collection for the genre.
Once all 50 movies in one genre are added to the MongoDB collection, you’ll repeat the above steps again for the next genre.

Now let’s code this class.

Import, class definition and __init__() method

scrape_movies()

Open the top 50 list of movies for a genre and create a collection for the genre in MongoDB

Key points —
—Reading page with BeautifulSoup
— Using the “find_all” method to locate specific div with a class id

Read each item in the list and get rank, title, URL

Key points —
— Using the “find” method to locate a specific element (‘div’, ‘a’) with a class id
— Using the “find_all” method to locate specific div with a class id

Open the web page for each movie and read some of the metadata for the movie

Key points —
— Reading page with BeautifulSoup
— Using the “find_all” method to locate specific div with a class id

Read further metadata for the movie in a generic way as metadata name and value pairs

Key points —
— Using the “find_all” method to locate specific div with a class id
— Using the “select” method to locate the specific elements with class id

Insert dictionary object into MongoDB and end of the method “scrape_movies()”

Starting/Invoking the application

This is written within the application module (the “moviescraper.py” file in GitHub), however outside the class.
Notice the check for “if __name__ == ‘__main__’:”. This will ensure that the code with this condition is executed only when run directly by the “python -m <<package name>> moviescraper.py” command.

If you want to make a package for the application, the above code will go into the “__main__.py” file in the package.

Run the program

If you have kept the methods in a package called “boxoffice”, you can start executing the program by command -

“python -m boxoffice moviescraper.py”

The program will start scraping automatically -

In the MongoDB collection, the documents can be viewed as shown below (with the MongoDB Compass tool) -

In the MySQL database, the names against the roles can be seen as below (with MySQL Workbench).

Happy coding!!

Download

GitHub — https://github.com/SaurabhGhosh/MovieScraper_Web_to_MongoDB.git

Conclusion

In this blog, I hope you got some ideas about below -

Using BeautifulSoup to parse and read a webpage.
Using different methods of BeautifulSoup to find elements in the webpage e.g. find, find_all and select.
Connecting to a MongoDB instance.
Connecting to a MySQL instance.
Creating a collection in MongoDB instance and inserting documents into the collection.
Creating a table in MySQL instance and inserting records into the table.
Using dictionary data types.
Exception handling with try and except.

Now that the data are collected, in the next blog, you’ll learn how to retrieve data from MongoDB and integrate it with a UI application built with Tkinter.

If you have any questions related to this program, please feel free to post your comments.

Please like, comment and follow me! Keep Learning!