EDA PROJECT — What are the factors affecting IMDB ratings?

Büşra Türker
İstanbul Data Science Academy
5 min readAug 3, 2021

I present to you my first medium post and Exploratory Data Analysis (EDA) project, which accompanied me on my Data Science journey that I started at Istanbul Data Science Academy. After the first two weeks of the training, we completed and presented the projects. Now, I will give information about what I did on the project.

First of all, as for the selection of the subject or the data, I wanted to study the subject for data analysis which interests me and may make me have fun. Then I decided to study with IMDB movies data that I found from the Kaggle. This report is an interesting survey about the IMDB movie ratings and observation what are the reasons that audiences like more a movie than the others.

The fundamental questions which are of this report can be sorted as follows:

§ What is the reason why some movies have high ratings while others have low ratings?

§ Which features affect this situation?

§ Who is the director and lead actor with the highest vote? (Just for fun)

I utilized Exploratory Data Analysis (EDA) on the data to answer these questions.

Variables

· Title: Name of the movie

· Year: Publishing date (between 2000–2020)

· Genre: Comedy, Fantasy, Romance, Drama, Music, Animation, Adventure, Action, Fantasy, Horror, etc.

· Duration: Total time of the movie

· Country: The country of the movie

· Language: The main language of the movie

· Director: The director of the movie

· Writer: The writer of the movie

· Actor: Lead actor

· Votes: Number of votes

· Budget: The money spent on the movie

· Worldwide Gross Income: Worldwide gross earnings

· Reviews from Users: Number of the comments coming from users

· Average vote: The mean of ratings

Especially, Genre, Duration, Country, Language, Lead Actor, Director, Budget describe the extent to which these factors contribute to evaluating the ratings of each movie.

After investigating the data and its distribution, I started the visualization. For visualization, I used the “Plotly” library in Python since I really like its graphs and visuals.

First of all, I started with the effect of genre on the average vote rating. As you can see from the graph below, the genre which has the most rating ratio is the action. Then, the comedy takes part in the second order. But for example, movies belonging to the musical genre are not preferred by most of the people who are voting. So, we can say that genre is a distinctive feature for movies.

Now, let’s see the effect of language, which is also so vital feature of movies in my opinion. In the graph below, you can see the distribution of movies for language in 2016. Even just for this year, you see the most common language as “English”. However, the language of the film with the highest rate in 2016 is Turkish. So, we can say that some other languages also may have a higher average ratio.

Also, let’s approach from the perspective of the factor of country. Actually, the effect of the country may be considered as parallel with the feature of the language. Thus, when you look at the pie chart below, the USA as a country and the action as a genre has the biggest slice of the chart, which shows the majority. When we turn back to the factor of the language, the majority is “English” in the graph above because the formal language of the USA and UK is “English”. Besides, the majority of the country and genre factors has almost 7.5 average votes.

As for the effect of worldwide gross income, let’s consider these issues. Would you vote higher for a movie with a higher gross income? Or, do you think a movie with a higher rating can acquire a higher worldwide gross income? The answer is hidden in the chart below. Therefore, firstly, we can say that the worldwide gross income has increased in recent years, and secondly, this feature cannot be a factor for getting high votes alone. Because although the average vote is stacked between 6 and 8, we see that movies close to 9 have lower worldwide gross income.

Finally, let’s examine the effect of the duration of the movie. When we view the graph below, we see that there is a directly proportional increasing relationship between the duration and the average vote. However, we can evaluate as an outliner in the last two films around 300 mins. And, we may say that the duration of the films with the highest average vote and the highest number of votes is around 150 minutes.

After analysis and visualization of the data, now “just for fun!” time! Let’s learn the top 5 movies of the IMDB between 2000 and 2020 years. Peter Jackson and Christopher Nolan, the directors of the Lord of the Rings and Batman films, which are my favourite ones, seem to have no intention of leaving the summit… What do you think?

Also, it’s nice to see a movie from my country in the top five.

I tried to convey the experiences that I gain from the project and give an idea about data analysis and visualization. It’s up to you to ponder why more deeply. I hope it was useful.

GitHub:

https://github.com/BusraTurker/Projects/tree/main/Data%20Science

Kaggle:

https://www.kaggle.com/chenyanglim/imdb-v2/code

--

--