Web Scraping IMDB Top 250 Movies

Nugroho
4 min readMar 15, 2024

--

Medium Stories Nugroho : Web Scraping IMDB Top 250 Movies Using Python BeautifulSoup
How to do Web Scraping IMDB Top 250 Movies

In the previous article, I talked about how to web scrape articles from the IGN website using the Beautiful Soup library.

Now in this article, I will explain how to do web scraping to retrieve top movie data from the IMDb site using the same library, Beautiful Soup.

As we know, IMDb is one of the leading sources for finding information about movies, and with web scraping techniques, we can access and analyze the latest and popular movie data.

Okay, let’s do it.

Start Web Scraping Top 250 IMDB Movies

First, let’s install the necessary packages (if you haven’t already)

# Install all three packages at once
!pip install beautifulsoup4 requests pandas regex

Next, we will import the necessary packages and define headers so that the websites we visit do not detect us as a bot.

# Import the Packages
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

# Define Headers and Target URL
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0",
"Accept-Encoding": "gzip, deflate",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"DNT": "1",
"Connection": "close",
"Upgrade-Insecure-Requests": "1"
}

base_url = 'https://www.imdb.com/chart/top/'

If you don’t know about each header, you can check this article. There is an explanation there.

Now, we will send a get request and parse the html to the destination url which in this case is the top 250 movies imdb page.

# Send a GET request to the Target URL
req = requests.get(base_url, headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')

After performing the get request and parsing the HTML, the next step is to find the tags and class names of each tag.

Only tags that contain movie information such as title, release year, duration, and so on will be retrieved.

Please note, the class name used in this code is the class name that works when this article was created.

If you run this code, and it doesn’t work, it’s just a problem with the class name, so please adjust the class name if an error occurs.

Top IMDb Movies — Web Scraping with Python
Look for tags that contain movie components such as title, etc.
# You can name the variables as you like
get_html = soup.find_all('div', class_='sc-b0691f29-0 jbYPfh cli-children')

After searching for tags that contain movie components, next we will tag tags that contain information on movie title, release year, duration, rating, number of viewers.

  • Movie Name
Image about — Look for tag that contains movie name
Look for tag that contains movie name
  • Release Date
Image about — Look for the tag that has release date
Look for the tag that has release date
  • Duration
Image about — Look for tag that has duration of the movie — Nugroho Web Scraping
Look for tag that has duration of the movie
  • Ratings
Image about — Look for tags that contain ratings — Nugroho Web Scraping
Look for tags that contain ratings
  • Viewers
Image About — Look for tags that contans viewers — Nugroho Web Scraping
Look for tags that contains viewers

After specifying the tags to retrieve and their class names, we will loop through them to get a list of 250 movies.

Here is the code.

movie_data = []
for html in get_html:

movie_dic = {}
# Movie Name
movie_name = html.find('h3', class_='ipc-title__text')
movie_dic['Movie Name'] = movie_name.text.strip() if movie_name else 'unknown movie name'

# Release Date
rel_date = html.find('span', class_='sc-b0691f29-8 ilsLEX cli-title-metadata-item')
movie_dic['Release Date'] = rel_date.text.strip() if rel_date else 'unknown release date'

# Duration
duration = html.find_all('span', class_='sc-b0691f29-8 ilsLEX cli-title-metadata-item')[1]
movie_dic['Duration'] = duration.text.strip() if duration else 'unknown duration'


# Rating
rating = html.find('span', class_='ipc-rating-star')['aria-label'].split()[-1]
movie_dic['Rating'] = rating if rating else 'unknown rating'

# Viewers
viewers = html.find('span', class_='ipc-rating-star--voteCount')
viewers = viewers.text.strip()
viewers = re.match(r'\(([\d.]+[MK]?)\)', viewers)
movie_dic['Viewers'] = viewers.group(1) if viewers else 'unknown viewers'
movie_data.append(movie_dic)

After looping, the data in the movie_data list will be converted into frame data.

Here is the result.

# Create a DataFrame
data = pd.DataFrame(movie_data)
data
Image About — Results from scraping top 250 imdb movies — Nugroho Web Scraping
Results from scraping top 250 imdb movies

Well maybe that’s all for this article. I hope it was useful.

Thank you for reading it.

--

--

Nugroho

Enthusiastic about data, Machine Learning, web scraping, Python, SQL & data viz, I also talk about money at www.cashnug.com