Mining Insights into Hollywood Movies

Ever since I remember, IMDB was my go to place to know anything and everything about the movies. I owe my wierd taste in movies to IMDB and without the site, I would have missed out on some trully beautiful gems.

So, when I saw a data set for the top 5000 movies in IMDB on Kaggle, I knew that I had to perform data mining to gain insights. The dataset comprises of over 5000 movies not just from Hollywood but from around the world. It has some financial information about the movies, the cast and directors and the corresponding IMDB Rank. Do note, that the dataset is in no way comprehensive. Nonetheless it is sufficiently big to pique my interest. If you would like to know more, then check it here

The objective of the data exploration was to answer these 3 questions:

  • What are the most frequently used Plot Key words, Movie Title and Genres ?
  • What is the trend for the Gross Revenue and Budget of the movies in nominal terms as well as inflation adjusted terms over 100 years ?
  • Who are the top actors, directors and movies for each of the past 10 decades ?

Gauging Frequently used Plot key-words

Plot key-words Word-cloud

The Plot Key-words that are frequently used, seem to paint a dark and gruesome picture of the world. I believe, a tragedy is what makes a movie great. Yet, pummeling people with nasty words to catch attention proves that clickbait headlines are being used since a long time, and are not an invention of the Gen Y.

Gauging Frequently used words in Movie Titles

Movie-Title Word Cloud

Frequently appearing words in a movie title, seem pretty banal in comparison to the Plot key-words. What’s surprising is how many times the word “Movie”, “Tale” and “Story” appear in a movie’s title.

Most Frequent Movie Genres

Movie Genres

The Frequently used Movie Genres are Drama and Comedy with Romance, Action and Thriller coming in a close second.

The surprising thing here is that the frequently used plot key words do not represent the comedy genre at all.

Revenue and Budget trends across 100 years

With the balooning budget and box office figures, I always felt that Hollywood movies would be making today a lot more than it did in the yesteryears. With the dataset, I finally had the opportunity to test my intuition.

I checked the Gross Revenue trends for the last 100 years at nominal value as well as inflation adjusted Gross Revenue.

Gross Revenue Trend across 100 years
Inflation Adjusted Gross Revenue trend across 100 years

It is surprising that even though in nominal terms, there is no general trend across decades for the Average Gross Revenue, however in inflation adjusted terms, we clearly see that the Average Gross Revenue is coming down since the 1950’s.

That clearly is counter-intuitive so let’s plot also the Movie Budget across the last 100 years to see if a similar trend persists.

Movie Budget Trend across 100 years
Inflation Adjusted Movie Budget trend across 100 years

We do see a similar pattern. The Average Movie Budget does not have a clear pattern in nominal terms, however the inflation adjusted average movie budget is coming down since the 1960’s.

That still does not explain why do we see box-office figures that are humongous and sometimes the size of GDP of a small country. To figure this mystery out, let’s plot the outliers across the Average Gross Revenue and Budget across the last 100 years.

Gross Revenue Outliers across decades
Movie Budget Outliers across decades

The above plots explain that there are few outliers that command humoungous box-office revenues while the majority languish at the bottom of the pile making less returns than what a movie in the 1950’s would make ( in inflation adjusted terms )

Top Movies, Actors & Directors across 100 Years

Finally, what good is IMDB if it does not showcase the Top Movies, actors and directors of all generations. With the Dataset I could mine the list of top directors, actors and movies not just by the IMDB Rank, but also by the Gross Revenue. The below graphs provide the top 5 list in each decade.

The List of directors changes dramatically when viewed based on Gross revenue of the movie vs the IMDB Rank. For. eg. Francis Ford Capolla does not feature in the top directors by Gross Revenue.

Also, Christopher Nolan, James Cameroon, George Lucas, Steven Spielsberg, Robert Zemicks and David Lean have directed top movies in more than 1 decade. They are the top directors of Hollywood, and the above lists definitely portray that.

Similar to the Directors, the Actors list also changes significantly when we check for Gross Revenue vs IMDB Rank. Also the data-set seems to be biased against female actors, as it does not throw a lot of female actors in the Top 5 list across decades especially when viewing the list by IMDB Rank.

The above list of Top 5 movies across decades comprises of some of my favourite movies. Even though the top 5 list of movies is different when viewed by Gross Revenue vs IMDB Rank, all the films are a must-watch for any movie buff.


That’s quite a few insights into Hollywood movies. If you are interested in checking the code to develop the graphs then check out my GitHub Repository.

If you see something interesting in these plots, or would like me to perform data mining on some other parameters then do let me know in the comments.

If you like the post then do applaud with the little green button at the bottom.