Exploratory data analysis of the IMDb’s movie database from a data scientist perspective

Ömer Faruk Eker
Analytics Vidhya
Published in
3 min readApr 8, 2019

Playing with data has always been a passion for me. It has been a long journey, more than 10 years. I have been dealing with a variety of data; ranging from engineering data-sets like UAV fuel system data or a passenger aircraft wing structural health data to retail business or fun Kaggle data-sets.

Watching movies, talking about them and doing critics is another passion for me. And combining these two passions was hilarious. Recently, I made an exploratory data analysis on IMDb’s movie database using one of the most frequently used data science programming language: Python.

I have always thought that 1994 is the year that best movies were made (Forrest Gump, Pulp Fiction, The Shawshank Redemption, etc.). That’s where this research idea was originated from.

The graph below contains two subplots, the former gives the total number of films made each year, and the latter gives the total number of voters for the films made at the corresponding year which was peaked in 2013 and a dramatic drop comes after. This is where the fun begins! The number of films made per year is increasing non-stop. Why did people stop voting in recent films? Is it because movies getting worse? Or is it IMDb website losing its popularity? However, it is obvious that IMDb still dominates online film rating sector as their mind-blowing monthly visitor number reaches 250 million [read more].

This could be another research, we shall continue to find an answer to my original question:

Was 1994 “the year” for the best films?

The graph below gives the average values. The former displays the average voters per film again on a yearly basis, where the latter gives average film ratings per year. Since the 1920s, average film ratings tend to fluctuate around 6/10. A slight increasing trend since the late 80s is noticeable however the difference is not statistically significant to say films are getting better ratings. Average film rating graph still not answering the original question.

If we examine the above graph with average voters per film, it was peaked at the 90s and early 2000s and has been dropping dramatically. Film ratings are increasing but the number of votes per film is dropping? How could you explain that? This means to me that people tend to vote more when they liked a film. They don’t bother rating a film which was average or boring. But this seems to apply for the recent films only, not 90s or 2000s films. Could we relate this to demographics, in particular, the age of dominant voters? This is another point of research.

In summary, did this analysis satisfy my original assumption that 1994 is “The year”? Yes, somehow did. Clearly, the films made in the 90s and early 2000s are voted and gained attraction the most, where 1999 is the highest; 1994 being the third.

Another interesting analysis result was the correlation between film duration and film rating. As seen in the figure below film rating is highly positively correlated with the film duration. The longer the film the higher likely to get higher ratings :) However, the ratings start to fluctuate more for films having run times +3 hours and more.

I am looking forward to continuing to work on this data to satisfy my data science desires on film data. Further analysis will be mainly focused on predictive analytics using machine learning algorithms. Any comments and recommendations are welcomed.

You can access to the codes and the dataset from my GitHub page:

https://github.com/omerfarukeker/imdb_work

Cheers!

--

--