IMDb Data exploration with ggplot2 and dplyr
For this project, we will need dplyr package for sql query and ggplot2 package for visualization. Now I’m loading these two packages.
First, I am loading the data to do the following exploretory, then take a glimpse on the data structure and the data.
Here I showed part of the str(dataset) result.
'data.frame': 5043 obs. of 28 variables:
$ color : chr "Color" "Color" "Color" "Color" ...
$ director_name : chr "James Cameron" "Gore Verbinski" "Sam Mendes" "Christopher Nolan" ...
$ num_critic_for_reviews : int 723 302 602 813 NA 462 392 324 635 375 ...
$ duration : int 178 169 148 164 NA 132 156 100 141 153 ...
$ gross : int 760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
$ genres : chr "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy" "Action|Adventure|Thriller" "Action|Thriller" ...
There are 5043 entries and 28 variables in this dataset.
Now I am going to answer a couple of questions by using dplyr and visualzing it with ggplot2.
The first question is, what are the total number of movies reviewed by year? To answer this question, I need to group the entries by year (title_year) and assign the count of entries in every group as n.
# A tibble: 6 x 2
1 1916 1
2 1920 1
3 1925 1
4 1927 1
5 1929 2
6 1930 1
It’s hard to see the variation through the years by reading it in table, so I plot it in line chart.
It’s obvious that number of movies grows quickly after 1990, it may due to the reason it’s difficult to rate movies that are before 1990 on IMDb or there are truely much less movies before 1990.
Now, let’s find out the average imdb ratings through out the years. Again, I need to group the entries by year, but instead of count entries in groups, this time I need to get the average of imdb rating (imdb_score).
Looks like due to the number of movies, the average score in early years have large variation, but becoming more stable to recent years.
Let’s see how are the average imdb ratings between different content ratings. Because some of the entries don’t have any content rating, I assigned those to ‘Unspecified’.
Looks like programs involve mature content (TV-MA) tend to have higher imdb ratings. On the other thand, PG-13 and PG have lower ratings. But all the scores of content rating are above 6.
The next question is, which director have higher imdb ratings in their movie? Here it shows the top 20 directors.
It seems among top 20 directors, most have about the same ratings except John Blanchard, who has 9.5 average rating.
Finally, I want to know for the movies after 2010, what are the top 3 movies in each year and the director, actors and gross for those movies. First, I set the years, then used a for loop to get the result in each year.
I noticed that some movies have the exact same movie title and gross. They are actually same movies but have minor difference in other column. However I still only want to keep those movie as one record for each. Therefore, I adjusted my code to exclude the duplicate records.
Now they make more sense to me, without duplicate records.
There are some other variables we can dig into. I will post them in the next story.
To see the full script and raw data, click here.