02. Investigate TMDb Movie Dataset (Python Data Analysis Project) — Part 2 Exploratory Data Analysis

13 min readMay 18, 2018

Note: This project was completed as the second part of Investigate TMDb Movie Dataset, which was a part of Udacity Data Analyst Nanodegree that I finished in March, 2018., for part 1 detail, see the previous article. For full project reports, codes and dataset files, see my Github repository.

In this part I used the dateset cleaned in Part 1 to show the exploratory data analysis process and results.

Problem Researched Part 1

General Exploration

Every time I go to a movie, it’s magic, no matter what the movie’s about.
— Steven Spielberg

We always define a successful movie based on it’s revenue, reviews, popularity, etc. But what’s the factor associated with a successful movie? We know that movies hitting the box office are not always with high rating, or high rating movies are always not in trend. But how about popularity v.s. revenue? It seems like when a film raises a burst of upsurge, no matter what the reviews rating bad or not, people are still willing to pay for a popular movie. For this point, let’s use the dataset to find the answer!

1. Movie Popularity Trend over Years

First I explored the movie popularity trend over years, from 1960 to 2015. I computed the mean of popularity in each year, and then plotted line chart to show the trend. Moreover, since the popularity has no upper bound, in case the mean of popularity was affected by the higher rating, I also computed the median for analyzing this question.

We can see that the trend of popularity mean is upward year to year, and the peak is in the 2015, while the trend of popularity median is slightly smoother in recent years. On average, popularity over years is going up in recent years. The trend is reasonable due to the easily access of movie information nowadays. In the Internet age, people can easily search and gather movie information, visit various movie introduction pages, rate movies, even watching the content through different sources. Probably it is such the background that boost the popularity of movies.

2. Distribution of Popularity in Different Revenue Levels in Recent Five Years

The movies popularity is growing up in recently years, but how about popularity v.s revenue ? Will popularity be more higher in high level revenue?

It leaded me to find out the distribution of popularity look like in terms of different revenue levels. Due to the revenue data is in wide of range, I divided the revenue data into five levels: ‘Low’, ‘Medium’, ‘Moderately High’, ‘High’ based on their quantiles with cut_into_quantile function. Also, I chose the recent five years data in order to focus on the more recent data feature.

The cut_into_quantile function for general use is as follows, the input argument dfname is target dataframe, the column_name is the variable supposed to be divided.

# quartile function
def cut_into_quantile(dfname ,column_name):
# find quartile, max and min values
    min_value = dfname[column_name].min()
    first_quantile = dfname[column_name].describe()[4]
    second_quantile = dfname[column_name].describe()[5]
    third_quantile = dfname[column_name].describe()[6]
    max_value = dfname[column_name].max()
# Bin edges that will be used to "cut" the data into groups
    bin_edges = [ min_value, first_quantile, second_quantile, third_quantile, max_value]
# Labels for the four budget level groups
    bin_names = [ 'Low', 'Medium', 'Moderately High', 'High'] 
# Creates budget_levels column
    name = '{}_levels'.format(column_name)
    dfname[name] = pd.cut(dfname[column_name], bin_edges, labels=bin_names, include_lowest = True)
    return dfname

Then I applied the function to the recently five years data, I first filtered out the datafram in recent five years, and then used these datafram as the function inputs and then cut revenue into five parts. Then I used the dataset to count the median of the popularity in each level in each year.

Plot the bar chart.

We can see that movies with higher revenue level have higher popularity median in recent five years. And we can see movie with high-revenue level has significantly higher popularity median than other levels!

The result is consistent with my previous point — a high revenue movie is always with a higher popularity than movies with lower revenue levels!

But what about the score rating distribution in different revenue levels of movies? Do high revenue level movies have high score-rating as well? Generally I think it is not necessary. Let’s explore the question!

3. Distribution of revenue in different score rating levels in recent five years

Similarly, I used same procedure as before to plot mean of score rating in each revenue level in recent five years.

From the chart above, we can see that there is no big difference of movie rating between each revenue level. So it can be concluded that the high revenue movies don’t have the significant high score rating based on the dataset!

Problem Researched Part 2

Find the Properties are Associated with Successful Movies

I have a very simple definition of a good movie: a good movie makes you forget you’re watching a movie.
— Michael Cimino

In Problem Researched Part 1 I found out that high revenue level of movies are always more popular than low revenue level. Now in Part 2 Research I am going to find out the properties associated with high popularity/ rating movies.

1.Function and sample prepared

The potential properties associated with high popularity/rating movies can be runtime, budget, cast, director, keywords, genres, production companies. In the dataset, these data are classified as two types: quantitative data and categorical data. Both runtime and budget data are quantitative data; the others are categorical data.

To find the successful movies properties for each category, I created two procedures for each kind of data.

For quantitative data, I divided the data into various levels and found the properties associated movies with higher level, I chose the whole dataset and then used function cut_into_quantile to divide runtime and budget into four levels according to their quartile: ‘Low’, ‘Medium’, ‘Moderately High’, ‘High’ based on all time range. And then found out what’s the runtime and budget level with higher degree of movies popularity/voting score.
For categorical data, which are cast, director, keywords, genres, producer, I just focused on movies in high popularity and high rating, so I filtered out the top 100 popular/ high voting score movies data in each year, so there were totally 100 movies product 56 years(from 1960 to 2015). And then used find_top function applied to the filtered dataframe to count the number of occurrences in every category and find top 3 as their good properties. Furthermore, in case that the top frequent occurrences were also appeared in the worst popular/ high voting score movies, I also filtered the worst 100 popular/ high voting score movies in every year and then compare the result to of top 100's.

The cut_into_quantile function is mentioned in Part 1. The find_top function is as follows:

# split pipe characters and count their number of appeared times
#argument:dataframe_col is the target dataframe&column; num is the number of the top factor
def find_top(dataframe_col, num=3):
    # split the characters in the input column 
    #and make it to a list
    alist = dataframe_col.str.cat(sep='|').split('|')
    #transfer it to a dataframe
    new = pd.DataFrame({'top' :alist})
    #count their number of appeared times and
    #choose the top3
    top = new['top'].value_counts().head(num)
    return top

The following is the top 100 popular/high rating data I filtered for categorical data.

2. What’s the budget level movie are associated with movies that have high popularity?

I used the cut_into_quantile function to divide budget data into four levels with it’s quartile: ‘Low’, ‘Medium’, ‘Moderately High’, ‘High’ and create a level column.

Then plotted a bar chart based on median of popularity in each levels .

From the chart above, we can see that movies with higher popularity are with higher budget level. The result is reasonable since movies with higher popularity may have a higher cost in promoting advertising. And with the high promotion level people always have more chances to know these movies.

3. What’s the runtime level are associated with movies that have high popularity on average?

Similarly, I used the cut_into_quantile function to divide runtime data into four levels with it’s quartile: ‘Low’, ‘Medium’, ‘Moderately High’, ‘High’ and create a level column.

Then plotted a bar chart based on mean of popularity in each levels .

We can see that the higher popularity movies has longer run time.

4. What’s casts, directors, keywords, genres and production companies are associated with high popularity?

First, I found the three highest occurrences in each category among the top 100 popular movies as before shown. And store the result table into variables in order to create a summary table.

Use the result above to create a summary table.

#Use the result above to create a summary dataframe.
df_popular = pd.DataFrame({'popular_cast': a.index, 'popular_director': b.index, 'popular_keywords': c.index, 'popular_genres': d.index, 'popular_producer': e.index})
df_popular

Finally, use the same procedure to find the three highest occurrences in each category among the 100 unpopular movies.

The summary are as follows:

Cast associated with high popularity movies: Robert De Niro and Bruce Willis. It's really reasonable because I have seen a lot of promoted movies content which are performed by them in my country. On average I think they do have the huge popularity in past years!
Director associated with high popularity movies: Steven Spielberg. It's no doubt that he got the first place since he has won so many awards and honors for his high quality and popular work!
Both of the most popular and unpopular movies are associated three mainly genres: Drama, Comedy, and Thriller. I infer that these genres are common in the movie industry.
Keywords associated with high popularity movies: based on novel and dystopia. It' also no doubt it comes out the result. Especially the based on novel movies. Nowadays tons of movies are made based on novel like Harry Potter, The Hunger Games etc, and they were also famous in my country.
Producer associated with high popularity movies and unpopularity movies: Warner Bros., Universal Pictures and Paramount Pictures. The three giants of movie indusry did produce such a various quality of movies.

5. What’s the budget level are associated with movies that have high voting score?

As the similar procedure before, I used the cut_into_quantile function to divide budget data into four levels, then plotted a bar chart based on median of voting score in each levels.

We can see that there is no big difference in average voting score at different budget levels. So from the result, high budget of a movie is not necessary to a good quality of movie!

6. What’s the runtime level are associated with movies that have high voting score?

As the similar procedure, I used the cut_into_quantile function to divide runtime data into four levels, then plotted a bar chart based on median of voting score in each levels .

We can see that there is no big difference in average voting score in different runtime levels. So from the result, long runtime of a movie is not necessary to a good quality of movie.

7. What’s the directors, keywords, genres are associated with voting score?

Use the same technique with question 4, I created the three highest occurrences in each category among the yearly-top-100 high rating movies.

And among the yearly-top-100 low rating movies.

After summing up both tables above, we can find that:

Martin Scorsese and Clint Eastwood have made top quality movies on average over the past years from 1960. And Woody Allen has made movies with rating s in a wide of range.
The top quality movies have the keywords with based on novel and woman director over the past years from 1960. The based on novel keyword are also within the top popular movies, but the result of woman director amazed me!

Problem Researched Part 3

Top Keywords and Genres Trends by Generation

We do things differently. You don’t have to worry about being part of a particular genre. You just go for it.
— Tyler Joseph

We have found out the properties associated with high popularity and high rating. Since in the dataset it has plentiful information about keywords and genres data which covered from 1960 to 2015, it would be more interesting that if I found out the keywords and genres trends by generation!

To do this, I divided the procedures into two steps:

Step one: group the dataframe into five generations: 1960s, 1970s, 1980s, 1990s and 2000s
Step two: use the find_top function to count out the most appeared keyword and genre in each generation dataframe. Then use the output to create charts.

1.Keywords Trends by Generation

We can see that the number of keywords seems very low in 1960s and 1970s in the database, so I took a quick look to the number of movie released.

Actually, the number of released was low from 1960s to 1970s, based on this article, the movie industry was experiencing a hard time during 1960s due to the financial difficulties in movie industry, as well as home-TV devices were booming that caused the number of movie audiences to decline.

Back to the keywords topic, according to the keywords trends figure, we can see that the keywords based on novel has dominated among 1960s and 1970s. Go back to the history, this article mentioned that:

During the early to mid 1960s, Hollywood looked to literary works and the history books for many of its films. The studios were increasingly willing to pay for film rights to various novels and literary works.

And the movie’s keyword in 1980s was nudity; in 1990s was independent film, which was the age that most studios created independent film divisions. After 2000, movie with keyword woman director was the most popular type. Nice trending :P!

2.Genres Trends by Generation

As you can see, the genre Drama was the most filmed in almost all generation. Only the 1980s was dominated by the comedy type.

Conclusions

The goal in the research is primary to explore three parts of questions:

Part one: General Explore

At part one, I explored some general questions. The result turned out that the movie popularity trend is growing from 1960 on average. Moreover, I focused on the movies which were with high revenue. I found movies with higher revenue level are with higher popularity in recent five years on average. Besides, movies with higher revenue level don’t have the significant high score rating in recent five years. And this results made me want to learn more: What’s properties that are associated with high popularity movies? What’s properties that are associated with high high voting score?

Part two: Find the Properties are Associated with Successful Movies

At this part, I first found out the properties that are associated with high popularity movies. They were with high budget levels and longer run time. And cast associated with high popularity movies are Robert De Niro and Bruce Willis; director associated with high popularity movies are Steven Spielberg; genres associated with high popularity movies are drama, comedy, and thriller but they also appeared in the most unpopular movies; keywords associated with high popularity movies are based on novel and dystopia; producer associated with high popularity movies are Warner Bros., Universal Pictures and Paramount Pictures, but they are also appeared in the most unpopular movies.

And the I found out the properties that are associated with high voting score. Each level in both runtime and budget don’t have obvious different high rating score. In other words, the low budget level or the low budget may still have a high rating. And Martin Scorsese and Clint Eastwood have made top quality movies on average over the past years from 1960; the top quality movies have the keywords with based on novel and woman director over the past years from 1960.

Part three: Top Keywords and Genres Trends by Generation

In this part, I explored the number of movie released trend year by year. Then explored the keywords and genres trends, with group the dataframe into five generations: 1960s, 1970s, 1980s, 1990s and 2000s.

The number of movie released are increasing year by year. And the it is in the accelerated growth trend. Besides, In 1960s and 1970s, the top keywords was based on novel, in 1980s, the top keyword was nudity. In 1990s, independent film became the top keyword. And after 2000, the movie with the feature woman director were released most. Further more, the genre Drama are the most filmed in almost all generation. Only the 1980s are dominated by the comedy type.

To sum up, I did find a lot of interesting information among the dataset, which contains such a plentiful information that I can dig out the properties about successful movies as well as different kinds of matrices so that I can cross-compared the result. I performed the basic data analysis process using the method I have learned so far. Just keep learning new techniques and hope I can explore more. Looking forward to exploring more!