How Data Science & Machine Learning are Revolutionizing The Film Industry

Where Mathematics Marries Commerce

Published in

DataX Journal

12 min readJul 16, 2020

Data Science and Machine Learning have been capturing the World’s attention since at least the last decade. From CS majors to Business students, everybody is intent on trying their hand at utilizing this technology to improve decision-making, turning it into an empirical process rather than an instinctual one. This is steadfastly a better approach since instinctual or “gut” decisions often subconsciously include biases and are not objective.

The use of Data Analytics models to improve decision-making for organizations is not a new concept though. The Oakland Athletics baseball team, led by Assistant General Manager Paul DePodesta, used sabermetrics to make better and more unbiased decisions when scouting for players. Sabermetrics is the empirical analysis of baseball games that can be used to find players who may be undervalued or overlooked because of some superficial “defects” and who can prove to be amazing picks for their team. This is chronicled in the 2011 Academy-Award winning movie Moneyball starring Brad Pitt and Jonah Hill.

A still from Moneyball where Dr. Peter Brand goes over his sabermetric model for scouting players

The Houston Rockets, a Basketball team in the NBA, used a similar approach to optimize their player strategies, climbing several positions in the rankings in 3 years from 2014–2017.

Data Science & Films

Coming to the application of Data Analysis and Machine Learning models in the Entertainment industry, we will look at a machine learning application built by two researchers, Michael T. Lash and Kang Zhao, at the Department of Computer Science and Management (respectively) at the University of Iowa. You may access their detailed research paper here: https://arxiv.org/pdf/1506.05382v2.pdf

When we begin to talk about movie production, we must touch upon the ever-present dilemma of movie producers and financiers: the Box Office collection.

Producing movies, for the most part, is similar to investing in stock markets: you may have a general idea about which movie may perform well at the box office and which will bomb. We see this every day with movies which are supposed to be blockbusters with megastars headlining them, which bomb at the Box Office and sleeper hits: independently produced movies which make several million dollars.

A key term that is often overlooked here is ROI. Several sophisticated analytics models exist which try to predict the box office or the footfalls of movies. However, they fail to factor in the budget, and consequently the return on investment. A movie may earn 150 million dollars at the box office but if its budget was 180 million dollars then it has lost money, while a movie that earned 20 million dollars at a budget of 3 million dollars is very profitable, and producers will like to go with the latter.

When making a predictive model on the revenue earned by a movie, feature engineering is incredibly important. In general, a movie may be studied or analyzed based on three broad categories of features.

Audience-based Features
Release-based Features
Movie-based Features

Audience-based features include everything about the audience reception of a movie as optimistic or pessimistic: these can be retrieved from tweets about the movie, the volume of discussions, and sentiment analysis of critic reviews and trailer comments on YouTube.

Release-based features include the number of theatres the movie was released in, time of the year of release, number of competing films, etc.

Movie-based features include the details of the cast, and what the movie itself is about. Of the cast, the lead or the “star” is important to be analyzed and evaluated for their “star power”, which is calculated based on their past profitable ventures. In addition, the model also leveraged a social network between the actors, forming a collaborative network to evaluate if any prior collaborations between them could positively affect their chemistry. In terms of what the movie was about, the genre, the MPAA rating, and the plot synopsis were utilized.

However, Lash and Zhao wanted to build a predictive model that could take the things we know about a movie in pre-production and predict whether it could be profitable at the box office. Hence, in their early prediction model, out of the three only movie-based features could be used since audience reception and release details are both evaluated in post-production.

The feature set that was used was primarily 4-fold:

“Who” features — who was involved in the movie ~ the star power of leads, the star power of the director, and the chemistry between the cast.
“What” features — what the movie is about ~ audience interest in the genre, metadata, plot synopses.
“When” features — when a movie is slated to be released.
“Hybrid” features — a match between “what” and “who” and a match between “what” and “when.”

System Framework

The MIAS framework (“Movie Investor Assurance System”) proposed by Lash and Zhao is as shown in figure 1.

The 4 phases of the framework are:

Data Acquisition: the model is based on historical data extracted from IMDb and BoxOfficeMojo. The former was used for the general movie synopses and extracted through the IMDb API while BoxOfficeMojo has more detailed information on budgets, revenues, etc. and the researchers used a simple web scraper since it does not have an API.
Data Cleaning: The data was then cleaned, transformed, and consolidated into a structured database. Extraneous characters that could hinder matching of titles were removed and a porter stemmer was used to remove stop words such as ‘the.’
Feature Engineering: Several features were constructed as mentioned before to form “hybrids” in addition to the “who, “when” and “what” features.
Training: With a well-rounded set of features in place, a predictive model was trained. Cross-validation was employed to choose the optimal parameters as well to choose the algorithm which performed best.

Feature Engineering

I. “Who” Features

Star Power — Star power of actors or directors is inevitably an important factor while producing a movie. They command the imagination of the public and carry a bankability. Since the goal is to predict profitability, the revenues and profitability of the past movies of the stars were taken into account. The 5 features included here include:

Actor Tenure — their time in the industry
Actor Gross — revenue generated by their movies in their tenure
Director Gross — gross for their past movies
Actor’s Profit — profits of past movies
Director’s Profit — profits of past directed movies

For each of these features, the total and average were taken to supplement a wider analysis.

2. Network-based features — To capture team characteristics, Lash and Zhao constructed a dynamic social collaborative network among actors based on their past collaborations. To put it simplistically, the network was a combination of edges (connections) between nodes (actors) based on their past collaborations, and the more than a pair of actors had collaborated in the past the thicker would their edge be. Thus, for any given year, the aggregated network presented a network of past and present collaborations.

The network consists of the following static features:

Network Homogeneity — for each movie, the team diversity was measured by examining structural similarity between actors by putting all the pairs of actors through a cosine similarity function. The higher similarity suggests that team members have been working with similar peers (and often each other.)
Average Degree — it represents the unique collaborations, intended to measure the new expertise brought in on a movie set.
Total and Average Betweenness Centrality — those people who have high social capital and can bridge otherwise unconnected groups.
Actor-Director Collaboration — we must consider the frequency of their collaborations as well as the profitability of these collaborations.

In addition to these, the network also consists of two dynamic features regarding the aggregated networks once a new movie is trained on, and new edges form between the nodes (actors’ relationships) and how this affects the network broadly.

II. “What” Features

To reflect what a movie is about, the “what” features included the genre (comedy, action, romance, thriller, etc.) as well as MPAA rating (PG-13, R) that we expect the movie to be in. Another important feature here would be the plot synopsis for a movie.

For synopses, simple unigrams and bigrams can be utilized but they will have high dimensionality and hence result in the problem of data sparsity. Instead, the model uses Latent Dirichlet Allocation (LDA) which takes a textual corpus of the plot synopses as its input and outputs a group of topics. Each plot synopsis is assigned a probabilistic distribution over all the topics. Such a topic distribution of a movie reflects its plot and can be used as a feature for predictive modeling.

III. “When” Features

The time at which a movie is released, including the time of the year as well as the overall profitability of other movies in that year/decade impacts the box office collection of the movie and hence these two factors, are included:

Average Annual Profit — in the year prior to the year of release of a given movie, to address market conditions
Release date — to incorporate festival or holiday releases, etc.

IV. “Hybrid” Features

It may be important to form a team of actors based on their previous experience with the genre of the movie being planned, instead of just their star powers. For this reason, it is important to look at combinations of features.

“What” + “Who”

For an actor, we can have a vector of the proportion of the movies that they have done for each genre (for 26 genres in this model) so that we can represent each actor’s strong points in the vector as a larger proportion of the movies that they have done. For eg., Adam Sandler may have a high value for the comedy node in his vector, while Arnold Schwarzenegger may have action as the highest proportion since he has done the most movies in the action genre.

Similarly for each movie, we can have a vector of the genre of the movie since a movie does not have to be one genre only, and those actors whose strong genres match the main genre of the movie can be more suited to it. Therefore, by using these metrics, Lash and Zhao created three features addressing the genre expertise of an actor towards a movie as well as a cast novelty feature, which addresses stars appearing in movies which are out of type for them, such as Adam Sandler appearing in serious, dramatic roles. This is because the novelty stars themselves can be a selling point.

2. “What” + “When”

Consumer preferences change over time. While romantic comedies were a staple in the 90s and 2000s, they are much more infrequent now with superhero movies and horror movies having taken their place at the Box Office. Meanwhile, competition may also affect a movie’s collection if it is released at a time when it has several competitors. Thus, we consider both “when” a movie was released, how movies in a genre performed in the previous year, as well as the level of competition during a movie’s release time. The following features were thus used:

Annual Profitability Percentage by Genre — it is the percentage of profitable movies with the same genre as a given movie in the year previous to its year of release. This feature reflects the degree of success for the movies of that genre.
Annual Weighted Profitability by Genre — it is derived from a weighted sum of the cosine similarities of a movie with similar movies from the previous year.
Competition — it reflects what other movies will be released during a similar time period. It is calculated by considering the average star power of all other movies released within 1 month of the movie’s release.

Dataset

The original dataset, collected from BoxOfficeMojo and IMDb was narrowed down to the 11-year period from 2000–2010 (inclusive) since this period is recent enough to reflect the current state of the industry and relevant revenue data is available readily.

Movies for which box office data was not available were removed; those which had an unknown rating or an unknown genre were also removed. Documentaries were also removed since they are not released to theatres. Additionally, franchise movies and sequels were also removed since their success is highly dependent on that of their prequels and they may be influenced by completely different factors.

The final dataset was of 2,506 movies whose distribution (by genre) is shown below.

Except for foreign films, our dataset is quite representative of the movies released in that period since revenue data for foreign movies can be difficult to standardize. Based on the plot synopses, the LDA algorithm generated 30 topics. The top keywords for each of these topics are listed below.

While the experiment will predict the success of 2,506 movies, the collaborative network built for the study incorporates the collaboration between actors in all the 14,097 movies in our original dataset.

Success of a Movie

As for the metric for success, the return on investment was used to provide an accurate picture of the profitability of a movie. It considers both profit and budget, and of course the higher the ROI, the more profitable a movie. The formula is given as:

The formula for the ROI

Classification

The prediction of a movie’s perceived success or failure can be a classification task, with any given film being classified as “profitable” or “unprofitable.” However, there is no industry gold standard on what ROI is considered as ideal, other than the fact that the higher, the better.

For both binary and multi-class classification, a host of algorithms were tried, including:

Logistic Regression
Naive Bayes
Multi-Layer Perception
Decision Trees
Random Forest
LogItBoost Classifier

The best algorithm was chosen on the basis of the best overall performance based on the following six metrics. All results were evaluated on a 10-fold cross-validation set where a higher value indicates higher performance.

Area under Receiver-Operator Characteristic Curve — a plot of the false positive rate vs. true positive rate where an AUC value of 1 is perfect classification and 0.5 is a random guess.
Classification Accuracy — the percentage of correctly predicted instances.
Precision — the number of movies classified as successful which are actually successful divided by all movies classified as successful.
Recall —the number of movies classified as successful which are actually successful divided by all movies which actually are successful.

In addition to these, the performance of the model was compared to the performance of 2 benchmark models from previous studies. They were chosen since they followed a similar early prediction model.

In the case of binary classifications, a movie is classified into one of the two classes: successful or unsuccessful movies. The researchers evaluated 2 decision boundaries and both boundaries ensured that a certain ROI is reached if a movie is successful.

The first decision boundary classified a movie as successful if it is within the top 30% of all movies, which translates to an ROI ≥ 24%. The Random Forest classifier (n=200) and the LogItBoost classifier performed the best with the random forest classifier leading in AUC, Accuracy, and Recall while LogItBoost had a higher precision. Below is a table comparing their performances, along with their performances without the hybrid features.

2. The second boundary gave an ROI ≥ 67%. Compared to the decision boundary of the top 30% ROI, this boundary further raises the bar for a movie to be successful. By defining profitability in this manner, our predictive task has become easier, which was evidenced by an increase in model performance. Top-performing algorithms can reach AUC and accuracy over 0.9.

As is evident, the new hybrid features contribute immensely and hence prove their effectiveness. In both cases, the model also keeps an edge over the benchmark models, consistently giving ~25% better performance.

Conclusion

Researchers Michael Lash and Kang Zhao proposed an MIAS (Movie Investor Assurance System) model in their study to aid investors’ decisions on movie production. It utilized historical data about the profitability of movies, the social network between collaborators, the trends of the market, the tastes of the audience as well as a host of newly engineered hybrid features to reach an impressive level of accuracy in its predictions. In addition to its obvious applications by movie studios, it can also potentially have theoretical implications such as revealing insights about how actors and directors collaborate as well as their influence on movie success. Going forward, the use of movie screenplays can be an interesting tool in evaluating movies, where we can use similar LDA analysis.

All in all, this model is quite successful in predicting the profitability of movies at an early stage and can be a great starting point in predictive models to be used in the industry by movie producing houses as well as investors.