Uncovering Movie Insights: A Journey of Web Scraping IMDb’s Top 1000 Movies

Unveiling the Cinematic Gems: A Data-driven Exploration of IMDb’s Top 1000 Movies

Arthur Chong

Published in

Artificial Corner

6 min readJul 15, 2023

Photo by Samuel Regan-Asante on Unsplash

Intro

In this article, I will share my experience working on a personal data science project where I utilised web scraping techniques to extract valuable information from IMDb’s top 1000 movies. By leveraging on Python’s Scrapy framework and applying data cleaning and transformation methods, I was able to uncover some interesting insights and create captivating visualisations! I will now take you on my journey of diving into the world of movie data and explore the process behind this exciting project!

I have also published my dataset on Kaggle for those who are interested!

IMDB Top 1000 Movies

Exploring what makes movies stand out from the rest

www.kaggle.com

Motivation Behind the Project

I am currently a Data Science and Analytics student at the National University of Singapore (NUS) and I am on summer vacation now (yay!). Being a first year student, I did not manage to find any luck in securing internships and I decided to use my free time to pick up new skills related to data science!
Thus, I enrolled in a course (by Frank Andrade The PyCoach) on web scraping and started learning about it.
After learning the fundamentals of web scraping, I decided to do a little personal project to apply what I have learnt (What better way to reinforce the skill you have learnt than to apply it somewhere right?).
I chose the IMDb’s top 1000 movies as the dataset for analysis as I believe that movie data can provide a rich landscape for exploration and analysis as they have a wide range of variables to perform data analysis on (i.e. genres, release years, ratings, box office performance etc.). Oh, and the IMDb website was fortunately not really complicated to perform web scraping on!

Overview of the Scrapy Framework: Unlocking the Web Scraping Potential

Scrapy has a high-performance crawler system that allows us to navigate through website efficiently, crawl multiple pages, and scrape data in a short amount of time compared to other python packages like Selenium and beautifulSoup (I am not too sure of the technicalities behind it but hey it works!).
Below is the code that I used to scrape the website!

import scrapy


class ImdbSpider(scrapy.Spider):
    name = "imdb"
    allowed_domains = ["www.imdb.com"]
    start_urls = ["https://www.imdb.com/search/title/?groups=top_1000&sort=user_rating,desc&count=100&ref_=adv_prv"]

    def parse(self, response):
        movie_container = response.xpath('//div[@class = "lister-item-content"]')
        for movie in movie_container:
            try:
                movie_title = movie.xpath('.//h3/a/text()').get()
                release_year = movie.xpath('.//h3/span[contains(@class,"year")]/text()').get()
                runtime = movie.xpath('.//p/span[@class = "runtime"]/text()').get()
                genre = (movie.xpath('.//p/span[@class = "genre"]/text()').get()).strip()
                rating = movie.xpath('.//strong/text()').get()
                metascore_exist = movie.xpath('.//div[contains(@class, "metascore")]/span/text()').get()
                director = movie.xpath('.//p[@class=""]/a[1]/text()').get()
                if metascore_exist is not None:
                    metascore = metascore_exist.strip()
                else:
                    metascore = 0
                gross_exist = movie.xpath('.//p[@class = "sort-num_votes-visible"]/span[@class = "text-muted"]/text()').getall()
                value = movie.xpath('.//p[@class = "sort-num_votes-visible"]/span[@name = "nv"]/text()').getall()
                if "Gross:" in gross_exist and (len(value) == 3 or len(value) ==2):
                    gross = movie.xpath(f'.//p[@class = "sort-num_votes-visible"]/span[5]/text()').get()
                else:
                    gross = 0
            except:
                print(f'{movie_title} is missing information')
                metascore = 0



            yield {
                'title': movie_title,
                'director' : director,
                'release_year' : release_year,
                'runtime' : runtime,
                'genre': genre,
                'rating':rating,
                'metascore':metascore,
                'gross' : gross
            }

        next_page_url = response.xpath('(//a[contains(@class, "next-page")])[2]/@href').get()
        if next_page_url:
            yield response.follow(url = next_page_url, callback = self.parse)

Cleaning and Transforming the Raw Data

After scraping the website and extracting them into a CSV file. It was time for the arguably most hated task to perform for any data scientist! Cleaning the data and transforming them into the correct data types. The raw data was not perfect and it would not be possible to do any data analysis on them. There were a few problems with regards to the dataset :

Data was in the incorrect format

For example, the release year of the movies were displayed in brackets (e.g. (1999)). Also, the runtime of the movie had the letters ‘min’ in them. I had to convert these into an integer data type before I could proceed in uncovering the insights.

2. Multiple genres for a movie

Another problem was that some movies had more than 1 genre. This would mean that when attempting to do data analysis involving the genres of the movie, we would run into some problem. Hence, I had to separate the genres and duplicate a row into the data frame for each genre the movie had.

3. Missing data

Lastly, there were some missing data from the IMDb website. There were some movies where there were no rating value or no grossing value of the movie. After analysing the number of missing values, I decided to remove them from the dataset as there were not too many of them and it would not result in a huge loss of data.

Unveiling Insights through Data Analysis

It was finally time for the exciting part! Exploring key trends and patterns discovered in the IMDb movie dataset. Firstly, let us take a look at the genre distributions of the top 1000 movies

It is very obvious that the ‘Drama’ genre is by far the most popular genre among the top rated movies. One explanation could be the frequent exploration of complicated human emotions, relationships, and experiences in films of this genre. They dive into universal themes like love, loss, redemption, and human development that connect with a wide audience. This makes Drama films incredibly compelling and appealing to a wide spectrum of viewers due to their emotional relatability.

‘Drama’ may be the most popular genre, but which genre tends to perform best in terms of box office? Let’s take a look!

Now this is surprising! Even though ‘Drama’ was the most popular genre, it did not perform as well as genres like ‘Action’ and ‘Adventure’ when it comes to the grossing of the movie.

Aside from these general trends, I have also tried to uncover some insights such as whether or not the average movie duration has increased throughout the years and the distribution of top movies released over the years in my kaggle notebook! These are just some of the examples that I am showing over here!

Visualising the Story: Tableau Public Dashboard

Apart from the general trends of the movie dataset, I have also created a Tableau Dashboard regarding some insights about the top 5 directors with the most counts of movies in the Top 1000 movies! From this, we can hopefully see the specialisations of these directors and their average grossing per genre!

Top 5 highest grossing movies of all time

It appears that films in the ‘Action’ genre do indeed generally see more success. The top 5 directors all share the genre ‘Action’ in their films! Furthermore, The top 5 highest grossing movies of all time are all Action films!

Something to note is that the grossing value shown in IMDb only includes numbers from US and Canada and is thus not representative of the grossing value on a global scale. This makes the fact that movies like Star Wars having a grossing value of almost a billion USD only in the US and Canada even more crazy! That’s a lot of money from one movie!

Unfortunately Medium does not support the embedding of Tableau dashboards and does the full feature of this dashboard cannot be explored further. However, I have embedded this dashboard into my kaggle notebook so do go check it out!

Conclusion

In this article, you have embarked on an exciting journey of web scraping IMDb’s top 1000 movies with me, cleaning and transforming the data, and unveiling hidden insights through data analysis and visualisation. By combining technical skills with creativity, we were able to gain valuable insights into the world of movies and contribute to the data science community. I hope this project can inspire others to explore the vast possibilities that data scraping and analysis offer and encourages them to share their findings with the world, or at the very least I hope that this would be an interesting read to you guys! Thank you for reading this far!

Connect with me!

LinkedIn
Email : arthurchong01@gmail.com