Web Scraping IMDB Top 250 Movies

4 min readMar 15, 2024

Medium Stories Nugroho : Web Scraping IMDB Top 250 Movies Using Python BeautifulSoup — How to do Web Scraping IMDB Top 250 Movies

In the previous article, I talked about how to web scrape articles from the IGN website using the Beautiful Soup library.

Web Scraping IGN Article Using Python BeautifulSoup4

In this article, I will show you how to utilize the power of web scraping to extract information from IGN articles…

medium.com

Now in this article, I will explain how to do web scraping to retrieve top movie data from the IMDb site using the same library, Beautiful Soup.

As we know, IMDb is one of the leading sources for finding information about movies, and with web scraping techniques, we can access and analyze the latest and popular movie data.

Okay, let’s do it.

Start Web Scraping Top 250 IMDB Movies

First, let’s install the necessary packages (if you haven’t already)

# Install all three packages at once
!pip install beautifulsoup4 requests pandas regex

Next, we will import the necessary packages and define headers so that the websites we visit do not detect us as a bot.

# Import the Packages
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

# Define Headers and Target URL
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
    "Accept-Encoding": "gzip, deflate",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "DNT": "1",
    "Connection": "close",
    "Upgrade-Insecure-Requests": "1"
}

base_url = 'https://www.imdb.com/chart/top/'

If you don’t know about each header, you can check this article. There is an explanation there.

Now, we will send a get request and parse the html to the destination url which in this case is the top 250 movies imdb page.

# Send a GET request to the Target URL
req = requests.get(base_url, headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')

After performing the get request and parsing the HTML, the next step is to find the tags and class names of each tag.

Only tags that contain movie information such as title, release year, duration, and so on will be retrieved.

Please note, the class name used in this code is the class name that works when this article was created.
If you run this code, and it doesn’t work, it’s just a problem with the class name, so please adjust the class name if an error occurs.

Top IMDb Movies — Web Scraping with Python — Look for tags that contain movie components such as title, etc.

# You can name the variables as you like
get_html = soup.find_all('div', class_='sc-b0691f29-0 jbYPfh cli-children')

After searching for tags that contain movie components, next we will tag tags that contain information on movie title, release year, duration, rating, number of viewers.

Movie Name

Image about — Look for tag that contains movie name — Look for tag that contains movie name

Release Date

Image about — Look for the tag that has release date — Look for the tag that has release date

Duration

Image about — Look for tag that has duration of the movie — Nugroho Web Scraping — Look for tag that has duration of the movie

Ratings

Image about — Look for tags that contain ratings — Nugroho Web Scraping — Look for tags that contain ratings

Viewers

Image About — Look for tags that contans viewers — Nugroho Web Scraping — Look for tags that contains viewers

After specifying the tags to retrieve and their class names, we will loop through them to get a list of 250 movies.

Here is the code.

movie_data = []
for html in get_html:

    movie_dic = {}
    # Movie Name
    movie_name = html.find('h3', class_='ipc-title__text')
    movie_dic['Movie Name'] = movie_name.text.strip() if movie_name else 'unknown movie name'

    # Release Date
    rel_date = html.find('span', class_='sc-b0691f29-8 ilsLEX cli-title-metadata-item')
    movie_dic['Release Date'] = rel_date.text.strip() if rel_date else 'unknown release date'

    # Duration
    duration = html.find_all('span', class_='sc-b0691f29-8 ilsLEX cli-title-metadata-item')[1]
    movie_dic['Duration'] = duration.text.strip() if duration else 'unknown duration'


    # Rating
    rating = html.find('span', class_='ipc-rating-star')['aria-label'].split()[-1]
    movie_dic['Rating'] = rating if rating else 'unknown rating'
    
    # Viewers
    viewers = html.find('span', class_='ipc-rating-star--voteCount')
    viewers = viewers.text.strip()
    viewers = re.match(r'\(([\d.]+[MK]?)\)', viewers) 
    movie_dic['Viewers'] = viewers.group(1) if viewers else 'unknown viewers'
    movie_data.append(movie_dic)

After looping, the data in the movie_data list will be converted into frame data.

Here is the result.

# Create a DataFrame
data = pd.DataFrame(movie_data)
data

Image About — Results from scraping top 250 imdb movies — Nugroho Web Scraping — Results from scraping top 250 imdb movies

Well maybe that’s all for this article. I hope it was useful.

Thank you for reading it.

Web Scraping IMDB Top 250 Movies

Web Scraping IGN Article Using Python BeautifulSoup4

In this article, I will show you how to utilize the power of web scraping to extract information from IGN articles…

Start Web Scraping Top 250 IMDB Movies

Written by Nugroho