Scraping IMDB data using Python BeautifulSoup and lxml
We’re continuing the web scraping tutorial series, and this is another blog on how to scrape data from the IMDB website. The data and the python notebook file are given at the bottom of the page for download.
The Internet Movie Database (IMDB) is one of the most popular websites in the world. It has a vast database of movies, actors, and directors, as well as trivia and reviews on each movie.
Because it’s so popular, many people use it to find information about their favorite actors and movies. However, IMDB’s website is not easy to navigate and doesn’t always provide the most relevant information on a particular actor or movie.
If you want to build a movie recommendation engine that recommends movies according to your taste, you’ll need data sets of different movies of different genres. Scraping IMDB makes it possible to extract all of this data in an automated fashion, which can then be analyzed by computer programs. This allows you to get precisely what you want without spending hours looking through pages of unorganized information.
This blog has a few key differences from the previous tutorial. They are listed below.
- We will be scraping data from the IMDB website.
- The data we scrape will be stored in json format.
- We will use Xpaths instead of CSS selectors to locate the elements on the HTML page.
We will extract the top 250 movies from IMDb using Python beautifulsoup, lxml, and a few other libraries.
We will extract the data from the IMDB top 250 list. We chose this example for a couple of reasons.
- The first reason is that — it is a simple website, and scraping the data will be straightforward. Even people learning the python programming language should be able to build a web scraper to scrape data from IMDB.
- The second reason is to introduce them to JSON, a format that many people use. Most tutorials focus on data extraction into CSV/EXCEL, and we wanted to give JSON a try.
We will be extracting the following data attributes from the individual pages.
- The movie URL — The URL gets us to the target page
- Rank — The rank of the movie in the top list
- The movie name — The unique name of the movie
- Movie Year — The year movie is released.
- Genre — The Genre of the movie could be a single genre or a list of genres.
- Director Name — The name of the director
- Rating — The IMDB rating of the movie
- Actors List — The list of the cast of movie.
Importing required Librarie
Here is the python code to import the required libraries first. We imported the requests library, Beautifulsoup Etree module from the lxml, the time library, random library, json library, unidecode library. If any of the libraries are not installed, install them first.
We will explain the use of each library in due course.
import requests
from bs4 import BeautifulSoup
from lxml import etree as et
import time
import random
import json
from unidecode import unidecode
Any web scraper needs to know where to start the scraping process. We usually refer to it as the start URL. We also need to create a user agent and an empty list to store the movie URLs.
Request library is a Python module that allows you to send HTTP requests and process the responses. It’s used by many applications, which makes creating web applications with Python straightforward.
The Python request library makes sending HTTP requests easy and receiving the responses. You can use this library to make simple calls, such as retrieving information from a website or sending data to a server over HTTP.
start_url = "https://www.imdb.com/chart/top" #request header = { "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36" }movie_urls = [ ]
The IMDb page looks like this. The first step is to find the URLs to each of the 250 movies listed on the page.
We will use chrome developer tools to find the link on the page. Instead of a CSS selector, we will use xpaths to locate the element. Open the developer tools and hover over the first movie name.
You can see the data is enclosed between the ‘a’ tag. This ‘a’ tag is enclosed in a table with 250 other links. We can use the following XPath to get the information out of it.
//td[@class=”titleColumn”]/a/@href’.
It says we need to go to the table and find class named “titleColumn.” Within that titleColumn, there is an ‘a’ tag. Find the link in the ‘a’ tag using @href means. Content attribute of a BeautifulSoup object is a list with all its child elements.
We’ve also used the Beautiful Soup and etree libraries here. BeautifulSoup is a Python library that makes it easy to parse HTML and XML documents. It’s useful for everything from quick, simple tasks to complex data mining and analysis. Whereas, Etree is a Python library for parsing and generating XML data. It’s an alternative to the standard ElementTree package, which allows you to easily parse, generate, validate, and otherwise manipulate XML data.
First, we get a response using the requests library. The next step is to make a beautifulsoup object using the response and the HTML parser. The etree or element tree converts the page into an XML tree structure. The XML tree structure makes programmatic navigation simple. Using the above code, we create a list of movie URLs.
Upon inspection, we can find that the data in the movies_urls_list is not in the format we need. It does not have the IMDb domain name, and the URL is too long.
We concatenated IMDB URL into the URL string we obtained. However, upon further inspection, we can see that even if we remove all items after the question mark (“?”) — it is still a valid link going to the same page. We need to add only this to the movie_urls list. The code below achieves this. Experiment with the data, and you’ll see.
How to add a time delay between requests using Python
It is always a good idea to give time delays between the successive requests. This is to ensure that we’re not burdening the target website server. If somebody does it aggressively, it can violate a law called trespass to
We will use a simple time delay function to activate this. The function gets a random number between 2 and five, giving that many seconds delay when the subsequent request is delivered. See the function below.
How to write Scraped data to a JSON file
We need to use the JSON library to write the scraped data into a JSON file. We will write a function to write the elements to the json file and invoke the function every time a new movie data is scraped.
We use the json dump function to write the scraped data and a dictionary to the file we created. Read the rest of the blog here: Scraping IMDB