Most Successful Movie Director of all time

Have you ever wonder who is the most successful director of all time? If you have, then you do not need to wonder anymore. Luckily, I have the answer for you.

Before I show you the list of best directors, I want to explain you the reasoning behind creating this list. I am a big movie Enthusiast, and I have watched countless movies throughout the first quarter of my life. From Shawshank Redemption to American Pie, From Dark Knight to the Lego Batman Movie, I have watched all genre and format of movie possible. Every movie has its own characteristics. There are a lot of factors which make a movie “hit” including its star-cast, story-line, promotion, and visual effects, but, in my view, the most important factor in the success of a movie is its director. Directors transform movies into stories which they want to share to the world. Directors have a compelling vision and groundbreaking style and each one has a personal stamp that cuts across films, genres, and decades

The idea to find out who is the best director of all time came in my mind after watching this episode of Epic Rap Battle.

Thanks to Kaggle and deepmatrix, I had data about 5000 movies ranging from 1916 to 2016. I had the information about movie title, release year, IMDB score, the number of voted users, budget, gross, director name, actor names etc. I am following IMDB rating for a really long time, and a big fan of its top 250 movie list.

I am using RStudio to analyze this data set. I want to share the step by step analysis to keep things as clear as possible.

Step 1: Load the movie.csv file into the environment and view the data.

movies <- read_csv(“~/workspace/r_working_directory/movies.csv”)

The movie dataset will look something like this.

After browsing through this dataset, I realized that this dataset does not have any variable which states if the movie was successful or not. I used all my previous experiences of searching for a quality movie on Netflix to create this criterion. I created a new variable which views a movie to be successful if it has an IMDB score of more than 7. I filtered the dataset to gather a dataset of all the hit movies. Why? Because I am here to find out the most successful Director of all time, so I am not bothered about the flop movies and considering only the hit movies dataset.

Step 2: Add a new logical variable “hit” which views if a movie was hit or not.

#movie will be a hit if IMDB score is more than 7
movies <- movies %>% mutate(hit = ifelse(imdb_score > 7, 1, 0))

#subset the hit movies
hitmovies <- movies[movies$hit == 1, ]

Now, I have a dataset of all the hit movies with me. The next step is to clean all the junk information. I removed all the movies which have less than 1000 no of votes. Why? Have you ever feel confident about buying a product on Amazon when there is only one comment on it?

Step 3: Remove all the movies with the number of reviews less than 1000 and arrange the movies according to the IMDB score and Gross.

#sort the movies based on IMDB score and gross and remove all movies with number of votes less than 1000
hitmovies <- hitmovies %>% arrange(desc(imdb_score), desc(gross)) %>% filter(num_voted_users > 1000)

After some more detailed analysis, I realized there are few duplicates observations in the dataset. So, the next obvious step was to remove all these duplicate movies.

Step 4: Remove all the duplicate movies from the dataset.

#Remove all duplicate rows
hitmovies <- hitmovies[!duplicated(hitmovies$movie_title),]

IMDB score does not tell everything about the success of a movie. The net profit earned by the movie is also an important factor to be considered to view the success of a movie.

Step 5: Add a new variable called profit in the dataset which is a difference between gross and budget.

#add a profit variable
hitmovies <- hitmovies %>% mutate(profit = gross-budget)

What should I do with the movies which do not have a director’s name? Hmmm. Right! remove them from the dataset.

Step 6: Remove all the movies from the dataset which do not have director’s name.

#remove all the movies from the dataset which do not have director’s name
hitmovies <- hitmovies[which(!$director_name)),]

Transforming a variable into factors is a great way to group the movies into various criteria.

Step 7: Transform the variable by coverting them into factors.

#convert all grouping variable into factors
hitmovies$language <- as.factor(hitmovies$language)
hitmovies$country <- as.factor(hitmovies$country)
hitmovies$content_rating <- as.factor(hitmovies$content_rating)
hitmovies$actor_1_name <- as.factor(hitmovies$actor_1_name)
hitmovies$director_name <- as.factor(hitmovies$director_name)

#covert hit variable into logical variable
hitmovies$hit <- as.logical(hitmovies$hit)

If you are feeling bored, lost or tired, do not worry you are at the final step. It’s time to reveal the secret. To find the most successful director, I will group all the movies based on their directors and sum the IMDB score of all those movies. I will finally rank the directors based on the sum of the IMDB score. I have also added several another parameter to show off the director’s talents including the average IMDB score, the average budget, the average gross and an average profit of all the hits movies under the director’s name.

Step 8: Group the movie dataset by director and rank all the director according to the sum of the IMDB scores of all their movies.

#Most successful director (no of hit movies, avg imdb score, avg budget, avg gross, profit sort)
bestdirector <- hitmovies %>% group_by(director_name) %>% summarise(movies = sum(hit), imdb = sum(imdb_score), imdb_avg = mean(imdb_score), budget = mean(!, gross = mean(!, profit = mean(! %>% mutate(rank = rank(imdb, na.last = NA)) %>% arrange(desc(rank))


It’s Steven Spielberg!

Surprise? Steven Spielberg is the most successful director of all time. Do not worry, I am not going to repeat the mistake what first Steve Harley did at Miss Universe Pageant, 2016 and then was repeated by Warren Beatty at Oscars, 2017. Believe me. He has 20 hit movies under his belt. Here is the dataset of all the hit movies which he has directed during his career which is not yet over.

Well, I hope that after going through the dataset of movies which he has directed, you will agree with the analysis. Great! what to do next? Do have done a great job of solving this mystery. Now you should treat yourself by watching one of these Steven Spielberg movies. Which one? You can always enjoy Catch me if you can: Journey of a man who conned millions of dollars’ worth of check or Saving Private Ryan: An experience of a group of U.S. soldiers who go behind enemy lines to retrieve a paratrooper whose brothers have been killed in action.

Thank you for staying with me until the end of this thrilling experience. Here are some bonus steps for you to learn more about data analysis.

Bonus: Export the dataset of directors in a csv file to share this great information with your friends.

#export the hitmovies dataframe into bestdirector.csv file
write.csv(bestdirector, “bestdirector.csv”)

Bonus: Which year had the most number of successful movies according to the average IMDB scores of all the movies released that year.

#Most successful year (no of hit movies, avg imdb score, avg budget, avg gross, profit)
View(hitmovies %>% group_by(title_year) %>% summarise(movies = sum(hit), imdb = mean(imdb_score), budget = mean(!, gross = mean(!, profit = mean(! %>% mutate(rank = rank(movies, na.last = NA)) %>% arrange(desc(rank)))

You can also find all the code for this analysis at my GitHub project.

If you want to uncover few more mysteries, please follow me on Twitter, LinkedIn or GitHub.