Analyzing Box Office Data
Introduction
I’m a pretty big Cinephile. I like going to the movies and thinking about movies. Thus, for my first project in data science I decided to analyze box office data. I wanted to find out what trends and interesting tidbits I could uncover for financially successful movies.
Data
On Kaggle.com I found a dataset entitled “Top 10 Highest Grossing Films (1975–2018)” (link). This dataset has the top ten highest grossing films for each year between 1975–2018 as well as some information about each of them — genre, IMDB rating, runtime, rank within the top ten, MPAA rating, studio, and world wide gross. This dataset was not missing any data so data cleaning was minimal. I just had to format the world wide gross to take out the commas and dollar signs.
Questions
From this data I wanted to answer three questions:
1) What does the typical successful movie look like (in terms of genre, runtime, MPAA rating, etc.)?
2) How much do a handful of studios dominate the business?
3) Has Hollywood become more dominated by a handful of studios over time?
Answers
A lot of the answers for the first question came easily. I found that the average runtime to be 119.87 minutes. The most common MPAA rating is PG-13. There is some nuance in the most common genre for successful movies. Strictly speaking the most common genre for successful movies are Thrillers and then Comedies and then Fantasy. The least common genres are History, Sports and Horror.
That’s not the end of the story, though. If you look at the average world gross of the movies in each genre, you’ll see that fantasy, adventure and animation films do the best on average, while thrillers and comedies are middle of the pack. So the movies that are little bit less likely to succeed, such as fantasy and adventure films, are the ones that wind up doing exceptionally. Meanwhile, comedies and thrillers seem to the safe bets. They’re likely to do well, but not exceptionally well. In addition, History, sports and horror did not do well in terms of box office or frequency. They seem to be the least likely to be successful.
The second question was how much a handful of studios dominate the movie business. To answer this I summed up all the box office each movie studio made over the past fifty years and produced the graph below
There are five studios that make most of the money in Hollywood (in descending order): Warner Bros. Walt Disney, 20th Century Fox, Paramount Pictures and Universal Pictures. Hollywood is a business dominated by the big players. If you want a successful movie, have it be produced by one of these studios.
*interesting tidbit: The National Air and Space Museum was included in this dataset as a studio. They made the second highest grossing movie in 1976 (Rocky was the first). It was a twenty seven minute documentary called “To Fly!” The film has further outlier status as it was the shortest film in the data set.
The third question was whether Hollywood is becoming less balanced over time. Is Hollywood moving to a state where only a handful of movies make all the money. To answer this I found the standard deviation for the top ten films in each year.
There is an upward trend in standard deviation, but this could be attributed to inflation. My conclusion is that Hollywood is not becoming more unbalanced as time goes on.
*interesting tidbit: the spikes in 1997 and 2009 are due to Avatar and Titanic respectively
Conclusion
From analyzing this dataset I found that if you want to make a successful movie it should be 120 minutes long, have an MPAA rating of PG-13. It shouldn’t be a history, sports or horror movie, and it should be a movie produced by one of the top five movie studios.
Some other factors that should be investigated for their effect on the success of a movie are whether that movie is part of a franchise (and how many previous films are in that franchise) and how much was spent on marketing the movie. The dataset was not suited for answering these questions, meaning further research is required.
Link to Pandas notebook