Watching movies is one of my favorite things to do either with friends or just by myself. Rotten Tomatoes provides an amazing guideline for choosing movies when I can’t decide what to watch.
I came across this movie dataset with movie tile, description, genres, duration, director, actors, users’ ratings, critics’ ratings and etc at Kaggle; I decide to explore the world of movie genres from the perspective of data analyst. In this blog, I focused on Exploration Data analysis and used graphs/tables to answer some fun questions I have.
Glaze at all genres
Movies are usually categorized with multiple genres. We selected the first genre as the main genre for each movies to create a “main _genre” column.
Using main genre to create this histogram can quickly showed the top 10 popular genres. It’s not surprising to see that Drama, comedy and Action movies are the top choses for film makers.
From the boxplots above, we can see that horror and Television genres have the lowest rating according to both audience_rating and tomatometer_rating. Since Television is not really a movie genre, we can conclude that horror is the least popular genre with lower ratings.
EDA of Horror Movies
Since movies with horror as a main genre is relatively few, in this analysis we account a movies as horror as long as it contains any horror elements in them.
horror_movies = movies[movies['genres'].str.contains('Horror')]
horror_movies.shape
>>>(2043,23)
There are only 2043 movies (11.5%) out of total of 17711 contain horror elements.
1. Missing data
Using above function, we can see that 53% of critics consensus is missing for horror movies. Rotten tomato critics don’t care about horror movies enough to write any comments for them I assume.
2. Variable Analysis
Variable analysis contains 3 major parts: Univariate, Bivariate, and Multivariate.
2.1 Univariate : highlight missing values and outliers
Categorical: Frequency table/ bar chart for distribution for each category
Continuous: Central tendency and spread using box plot or histogram
Audience Rating and Tomato-meter rating
The distribution of Tomatometer_rating is very interesting. Although many horror movies received either 0 or 100 ratings, the rest of the movies are distributed almost uniformly with an average of 52.2 and stv of 25.9.
Unlike Tomatometer_rating, Audience rating doesn’t contain as many extreme values, but also has an average of 46.4 with stv of 19.8. Guess audiences are not as easily trigger as Tomatometer in terms of horror movie rating?
2.2 Bivariate: find relationship between two variables (correlation analysis)
Categorical vs Continuous : box plot
Continuous vs Continuous: scatter plot, regression plot
Tomatomete_rating and Audience_rating are very weakly positive correlated.
G and NC17 ratings are the two extreme content ratings; they both have higher average counts with lower average total counts. Not that many people watched G and NC17 movie, but people tend to give higher ratings once they watched them.
The following are the functions that I used for this part of the project. For ,ore detailed analysis for each variable, check this notebook!
3. Time for questions and answers!
What type of horror movies has higher rating/ more popular?
Most horror movies are a mainly focused on horror and followed by mystery&suspense; the rating for them are a little below the average(49). Documentary and animation are the top two genres for higher rating but not that many of them are made. It is safe to say that Classic&horror is the best type given relatively higher movie count and high rating.
In terms of popularity, horror movies that with romance has highest rating counts.
When does horror movie start to release and start to gain it’s popularity?
The first horror movies we have record for are “The Penalty” and “The Phantom Carriage” that released on 1920–01–01.They both are NR and surprisingly, they both have higher positive ratings(80 and 100!)
Horror movies started to gain its popularity around the 80’s and really got more acceptable after 2003. Horror movies didn’t start streaming until 1998; it took almost 10 years for horror movies to finally start growing. Streaming release count reached its highest from 2016 -2017 and started to slow down after.
What happened from 2016- 2017?
From above graph we can see that 2016 has most production companies released horror movie but the number dropped in 2017. The question becomes — what happened in 2016? I came up with two assumptions and tried to prove them using data visulization.
- Maybe some companies re-leased movies for streaming?
t3=horror_movies[horror_movies['streaming_year'] == 2016].groupby('original_year').rotten_tomatoes_link.count()
t3.loc[2016]/len(horror_movies.rotten_tomatoes_link)
>> 2.2026431718061676 %
In year 2016, there are only 2.2% movies are originally released in 2016; 97.8% of the movie streaming-released in 2016 are actually old movies.
- new production companies started to produce horror movies?
sum(horror_movies.groupby(‘production_company’).streaming_year.min().values==2017)/len(horror_movies.production_company.value_counts())
>> 32.48882265275708 %
What are the best year and the worst year for horror movies?
Selected movies with original released year after 1998, since streaming release year started in 1998. The most recent year 2020, not that many people participated in the rating, yet the average rating reached the highest. It’s safe to say that 2020 is the best year for horror movies.
The average rating reached its lowest in 2005, but the number of people who rated went up compared to the previous year, which means there are some lower quality horror movies are released that year.
What are the best production companies for horror movies?
For best production companies, we can not only consider the ratings only since there are 32 companies that released one super highly rated and then they all decided to make horror movies no more.
Universal Pictures has the highest movie counts with highest average rating amount the top 10 hard work companies. This column is significantly skewed: about 90% of the production companies only produced less than 5movies, which doesn’t provide valuable information.
There is a possibility that bigger production companies released way more movies than other, which might be winning by quantities. Thus, we remove outliers by using IQR method.
20th Century Fox and Lionsgate produced relatively high quality horror movie regarding the total movies counts.
Thanks for reading. Please check the completed notebook here.