Movie Ratings Case Study

Salem Al Mudhaf
14 min readNov 1, 2022

--

Using Microsoft Excel Pivot Tables to analyze Rotten Tomatoes Movie Ratings

Photo by Myke Simon on Unsplash

Introduction

Throughout this case study, we will dig through a Movie dataset to answer key questions and obtain an overview of the film industry. Also, as a cinema enthusiast, nothing excites me more than exploring this data set. Through access to Maven Analytics, an educational platform, I was able to download the Movie Ratings data set, originally sourced publicly from Rotten Tomatoes, and come up with key findings.

Background info (Rotten tomatoes)

Rotten Tomatoes is an American review-aggregation website for film and television. The Rotten Tomatoes rating system uses a scale better known as the “Tomatometer”. This represents the percentage of positive reviews for a given film or show. The Tomatometer score is calculated after five reviews. As the reviews come in, The Tomatometer measures the positive reviews against the total reviews and assigns an overall Tomatometer rating out of 100. After calculating the ratings, a different scale is used to determine the Tomatometer status. A rotten status is given when less than 60% of the reviews are positive, a fresh status is given when more than 60% of the reviews are positive, and a certified fresh status is given when more than 75% of the reviews are positive. The audience also rates the movies with a rating out of 100.

Data overview

The data set used in this case study comprises ratings and reviews for 15,000+ movies reviewed by Rotten Tomatoes. Each record includes film details (title, description, rating, genre, directors, cast, release date, studio, runtime, etc.), as well as featured reviews, “Tomatometer” status, and audience ratings. The data is downloaded as a CSV file from Maven Analytics’ publicly available datasets. The dataset is also made available by Rotten Tomatoes, a public domain. The dataset contains 16,638 records and 17 fields, with each record pertaining to a unique movie.

Data Processing

Although the dataset seems to be comprehensive, there seem to be some errors that may skew the analysis. Therefore, we need to do some data cleaning to prepare for analysis.

Using the filter icon, it appears that some ratings such as “PG-13)”, amongst others, have extra characters at the end which may cause them to appear as a completely different rating as shown below.

The extra characters have therefore been removed for all records using the find and replace tool in excel.

Another data cleaning step must be taken for ease of analysis. Currently, for the genre column, each film has several different genres. This makes it difficult to obtain the film distribution by genre and therefore it will be assumed that the first genre listed is the primary genre. We must first create a new column labeled “Primary Genre” and then apply the relevant functions. The excel function we can use is the “Left” function, which returns the selected number of leftmost characters within a text. The issue is that each genre has a different number of characters and so a standard number cannot be used. Since all genres are separated by a comma, we can use the “Search” function wrapped inside the “Left” function to return all characters to the left of the first comma identified. The code and resulting column can be shown below.

=LEFT(E2,SEARCH(",",E2)-1)

However, for movies with only one genre, there seems to be an error. This is because excel is not able to find a comma within the text since it does not exist. Therefore, we will wrap our previous code with an “IFERROR” function to eliminate the error and return the genre in the original column.

=IFERROR(LEFT(E2,SEARCH(",",E2)-1),E2)

Key Questions

  1. What does the distribution of films look like by rating? By primary genre?
  2. What % of films received a Certified Fresh Tomatometer rating? What about Rotten?
  3. Explore new film releases over time. How has the volume of releases by month trended over time? What year/month were the most new films released?
  4. Compare average Tomatometer ratings by Studio. Which studios produce the highest-rated films, on average? The lowest?
  5. Compare the Tomatometer ratings against audience ratings. Which films showed the largest discrepancies between audiences and critics?

Analysis and Visualization

The first step in the analysis process would be to use highlight all the cells and insert a pivot table in a separate excel sheet.

In order to grasp the overall distribution of films by rating, the rating field was used as a row and the movie title field as columns. In pivot tables, once a qualitative field such as movie title is added to the columns tab, it defaults to a count, which is just what we want. The number of movies per rating is shown below along with a visualization of the data. To provide some context, NR movies are movies that have not been rated for any possible reason.

However, as can be seen above, 29% of movies were not rated. Therefore, in order to obtain an authentic view of the distribution of rated films by movie ratings, NR films have been removed from this subset as depicted below.

Through the pie chart above, it can be deduced that more than half of the rated films in this data set are rated R. Another thing to notice is that in both visualizations, the NC17 rating is rarely given to a movie.

An interesting thing to explore would be the distribution of films by genre. By placing the primary genre as a row in the pivot table, we are able to view the list of different primary movie genres as shown below:

After using including the movie title field in the value tab, we can see the count of movies falling into the different genre buckets. In order to avoid a crowded visualization, it was decided to only display the top 10 genres by movie count using filters. The list was then sorted by descending movie count to create a column chart displayed below.

The three most common primary genres are Comedy, Drama, and Action & Adventure. The rest of the genres have a much lower count of movies. In order to understand the proportion of movies per genre we will drag another instance of the movie title into the values tab and show the values as a percentage of the column total as shown below.

Only genres representing above 1% of the data were displayed, to focus on the most popular genres. Comedy, Drama, and Action & Adventure make up more than 60% of all movies. It is also quite interesting that movies with Mystery & Suspense as the primary genre only make up 2% of the dataset. This may be due to the fact that movies that have Mystery & Suspense as one of their genres do not consider it the primary genre. To dig deeper into this theory, we can view the total number of movies that have Mystery & Suspense as a genre. This can be done by using the label filter for genres containing “Mystery & Suspense”. The resulting total is 3325 movies. This means that for movies that include Mystery & Suspense, only 8% have it as their primary genre, which is calculated by dividing 261 by 3325.

The next aspect to focus on would be the distribution of films based on Tomatometer status. It is clear from the pie chart below that 18% of films received a Certified Fresh Tomatometer rating, which is obtained when at least 75% of the film's reviews are positive. It is also shown below that 43% of films received a Rotten Tomatometer rating.

Movies have become more popular and available to more people over time. We can visualize this by creating a line chart displaying the number of films in theatre per year for the past 20 years of data. This would pertain to 1999–2018. This is done by placing the movie field in the values tab and the release date in the rows tab. The rows tab is then filtered to only include data from 1999–2018, yielding the chart displayed below.

The graph portrays a steady increase in films released per year over the last 20 years. The trend is not completely linear as some years saw a drop in the number of films released, however, the overall trend displays a positive correlation between the two variables. Another interesting aspect to explore is the months that see the most films released. Therefore we will remove the years from the rows tab and aggregate the data by month to create a column chart. However, we must also remember to clear the filters and include all the years in the dataset.

As displayed above, the month of January experiences a large number of film releases compared to other months. This may be due to it being the start of a new year and a time of holidays. A large number of new films in January may not translate to a higher quality. Therefore we will visualize the average Tomatometer rating for films released in each individual month. This will be done by replacing the movie field with the Tomatometer rating field in the values tab and displaying the values as averages instead of the default sum.

It appears that films released in November and December receive the highest average Tomatometer rating. Surprisingly, the third highest-rated month is January, the month with the most film releases per year. February holds the lowest average Tomatometer rating out of all the months of the year.

The dataset has information on the different studios that produce the films. Therefore, it would be interesting to explore which studios produce the highest and lowest-rated films, on average. This is done by placing the studio name field in the rows tab and the Tomatometer rating field in the values tab. After examining the dataset, it seems that a lot of studios have extremely high ratings after producing only 1 or 2 movies. These studios will skew the findings and lead to inaccurate results. Therefore, we will create a value filter to only display studios which produced more than 50 movies. After that, we add another filter to display the top 10 studios by Tomatometer rating. The column chart created is displayed below.

According to the chart, the most successful studio by average Tomatometer rating is Criterion Collection. They have maintained an average rating above 90 while producing more than 50 films, which is an impressive feat. The difference between the first studio and the second studio is almost equal to the difference between the second studio and the tenth studio. This shows that the Criterion Collection studio has set itself apart in terms of film quality, measured by Tomatometer rating. Criterion Collection is not one of the most popular studios at the present moment and curiosity began to set it in. Therefore, we can look for the particular films that are associated with Criterion Collection studio. We filtered the data for this studio and sorted the movies descending by average Tomatometer ratings. In order to maintain credible information, we filtered the movies displaying only the ones which have been rated by more than 30 film critics. After that, we displayed the top 10 movies sorted by rating. Personally, I had not heard of those top-rated movies. So we will display the years those films were released in the chart below to check if they are recent films or not. The reason the studio name is not very famous today is that its most successful films were released before 1970, more than 50 years ago.

To better understand Criterion Collection’s operating period, we have organized the data to display the number of movies released per year by their studio, as shown below. Indeed, this studio is not active anymore and their most recent film was produced in 1997, 25 years ago.

To obtain a general overview of the size of the different studios within this dataset, we will sort the studios by the number of movies made in descending order. This was done by placing the studio name field in the rows tab and the newly created # of movies field in the values tab. The table will then be filtered to only show the top 10 studios by the number of movies displayed in descending order. Lastly, we will portray the information through a column chart as presented below.

Paramount Pictures have produced the most films throughout this dataset followed by Warner Bros. Pictures and Universal Pictures.

Movies at Rotten Tomatoes have two types of ratings, the Tomatometer and the audience ratings. Comparing the two values will enable us to understand any discrepancies between them. Firstly, we will include both the Tomatometer ratings and audience ratings fields in the rows tab. We will then display those values as averages to obtain a general comparison. As shown below, on average and across the whole dataset, both ratings were equal.

The next step would be to view these measurements at a film level. This is done by including the movie title field with the Tomatometer rating field and the audience rating field in the rows tab. After that, we will create a calculated field that calculates the difference between the two ratings per film as an absolute value to avoid negative answers. We can use this calculated field to identify the movies with the highest discrepancies between the two types of ratings. We will then filter the table to only show the films with the largest difference between the two ratings as depicted below.

As you can observe, large discrepancies can occur when either one of the ratings is very high or very low.

The last factors to observe are the average Tomatometer ratings and counts over time. The year field available in the dataset refers to the year the film was produced. So it would be interesting to visualize the yearly average film rating and rating count over time. The yearly average film rating is the average rating of all the films released in a particular year. The count refers to the number of people who rated the movie. The year field will be placed in the rows tab, and the Tomatometer rating and count fields placed in the values tab. A line chart will then be inserted to visualize the data as presented below.

The general pattern is quite clear from the image, the newer the film, the lower its rating and the higher its rating count. Older films have much higher yearly average ratings compared to newer films. This may be because those older films were rated by significantly fewer people as opposed to the newer films.

The audience rating count is a very useful metric to determine the popularity of a movie. When members of the audience rate a movie, then ideally, they have watched it. Therefore, we can assume that the audience rating count corresponds to the number of people who watched the film. This can be done by only using the numbers in comparison with each other and not as a basis for further calculations.

This information is important because it will be used to create a dynamic dashboard for the most popular movies. Before we create the dashboard we must input the necessary data. We will include the Movie Title field in the rows tab, and the Audience Count and Audience Rating fields in the values tab. A filter will then be placed to only show the top 10 movies by Audience Count. Once that is done, we will include a chart to display the data. This will be a combo chart where the Audience Count is shown through the columns and the Audience Rating is shown through a line on a secondary axis. Now, in order to transform this chart into a dynamic dashboard, we will add a slicer and timeline. A slicer is a tool available in Microsoft Excel that provides buttons that you can click on to filter your data. In this case, we have selected the filter to be Genre. This means that through the slicer, you can select whatever genre you want and the graph will display the 10 most popular movies in that particular genre. The second tool added is a timeline, which is a filter that can be used to limit the data to a particular timeframe. In this case, the timeline will be used to select the particular years reflected in the graph. These years correspond to the release date of the movies. One iteration of the dashboard is depicted below.

The Genre selected is Animation and the time period is from 2005 to 2015. Therefore the graph will only show Animated movies released between 2005 and 2015 sorted descending by popularity. As you can see, the most popular movies, in this case, are “Shrek the Third”, “The Simpsons Movie”, and “Ratatouille”. It is also interesting to note that although the popularity of a movie is presented in descending order, the rating does not necessarily follow the same. It can also be seen that “Ratatouille” and “Toy Story 3” have the highest audience ratings of the 10 films. A second iteration of the dashboard is presented below.

This iteration portrays the most popular “Action & Adventure” movies released from 2010–2019. The movie “Shutter Island” is presented number one, followed by “The Dark Knight Rises”, and “Marvel’s The Avengers”. To further demonstrate the applicability and functionality of the dashboard, a third and final iteration will be portrayed.

This dashboard displays the most popular “Comedy” movies released from 1987–1999, with the movie “American Pie” taking the top spot.

This has been a very interesting dataset to explore. Several variables such as genre, studio, and rating were analyzed through Microsoft Excel Pivot Tables to identify numerous trends and patterns. We were able to clean and process an expansive and informative dataset and made use of key questions to guide the general analysis and obtain an overview of the data. We also dug deeper into the dataset to perform further investigations and designed visualizations to depict our findings. It can be deduced that we have gained a newfound understanding of the film industry as a whole.

--

--