Data Analytics and Hollywood (Part 1)

Christopher Du
NYU Data Science Review
8 min readFeb 23, 2024

How film companies predict box office hits with predictive analytics

Photo by Ahmet Yalçınkaya on Unsplash

“Nobody knows anything…Not one person in the motion picture field knows if a movie is going to work…” — William Goldman, 1983 [1].

But with the recent advancements in predictive analytics and complex machine learning models, is screenwriter William Goldman’s claim from forty years ago still applicable to the entertainment industry? Data science and filmmaking sound like an unlikely combination. Even just by mentioning the words “analytics” or “algorithm”, images riddled with green digital patterns conjure up in our minds, far from anything we associate with artistic endeavors. But if you walk into any office of major film studios like Marvel or A24 today, you’ll likely find more analytical reports or code snippets than screenplays.

But how could this be?

Photo by GR Stocks on Unsplash

To preface, filmmaking has always been an inherently risky process. As Alejandro Gonzalaz, director of The Revenant and Birdman famously quoted: “To make a good film is war. To make a very good film is a miracle” [2]. Every step of production, from when the screenplay first lands on the director’s desk, to the tented-up film shoots, to the final marketing and distribution, is riddled with innumerable risks and financial pitfalls for the production company.

Photo by Chris Murray on Unsplash

When an actor forgets his lines and the shoot has to last longer than expected, extra location and HR costs accumulate, sometimes up to the millions. When the company releases a film during a month when there’s a major social media or perhaps political event that draws audiences’ attention, the almost unnoticeable controversy results in unimaginable losses in sales. To make the right management and financing decisions, film companies need the foresight to not only recognize every tiny problem in the movie-making processes but also predict whether potential films will resonate with audiences at all.

So how can they possibly have this clairvoyant ability?

Here comes in data analytics.

Nearly every major studio in Hollywood has a large backing of data teams, employing various predictive modeling and data analysis methods to assist studio decisions. Whether it’s finding the right audience to market their films, or analyzing the right star to cast, these data models are rooted in every step of the movie making process. This article is the first part in a series that explores the major ways in which data science is used in the film industry. We will take a deep dive into specific analytical frameworks used in predicting film box office, while the next volume will explore how these tools are used in marketing and production management.

Photo by Tech Daily on Unsplash

Box Office Predictions

By far the most common use case for data analytics among film production companies and studios is film revenue predictions. Using data from sources like commentary on social media and historic box office records, film studios have structured predictive models capable of relating an array of variables like release time or budget to a movie’s potential success.

Photo by Jake Hills on Unsplash

Regression Models

Regression models are predictive analytic tools that are able to relate an outcome with different explanatory variables. They’re also able to estimate how much the outcome is affected by each individual variable [3]. Given the complex elements that might impact a movie’s success, these analytical frameworks are especially useful for box office predictions.

Recently, Mr. Xu Liang, the former CFO of Bona, the largest non-state-sponsored film studio in China, presented his research on predicting film revenue and rating. He structured forecasting using a multivariate regression model, a specific type of regression that can relate multiple variables with multiple results. Categorizing data from a film’s IP (intellectual property), casting, and genre, he predicted film revenues of 1000 past Chinese productions with an accuracy of 74.6% [4].

Predicted gross income of movie The Legend of LuoXiaoHei in first five weeks after opening: https://www.mdpi.com/2078-2489/13/6/299

Another group of Chinese data scientists from 2022 on predicting film ROI employed a multiple linear regression, a similar type of predictive modeling, in relating a film’s director, release schedule, and budget with the future box office. Sourcing data from the last twenty years, the model reached an accuracy of 84.64% [5]. The accuracy of their model was highlighted especially when it predicted the box office amount of 325.4 million for the 2016 movie The Great Wall, just off from the actual amount of 334.9 million [5]. Although these published findings are experimental, from speaking with the former CFO of Bona, I’ve learned that these exact methods have already been employed in many film companies in the US. He also disclosed that within large Hollywood film companies, such analytics might have more of a say in what movies are greenlit than even the creative judgment of the film producers [6].

Decision Tree

Decision trees are predictive modeling methods that divide data into tree-like patterns of smaller and smaller subsets. And when predicting the outcome of a certain scenario, the tree will predict based on what subset the event falls into and arrive at smaller and more accurate conclusions [7]. These models may not always provide a clear-cut answer, but rather they present options so data analysts can make an informed decision.

Example of a decision tree model taken from official research report: https://www.iaeng.org/publication/IMECS2

Researchers in a 2015 study on film ROI constructed a decision tree model to predict the box office of US films. They employed a CART algorithm, a specific decision tree building method, and sourced data from 104 films. The resulting model related elements of a film’s genre, opening month, budget, and duration, to box office results and generated predictions with an accuracy of 72.6% [8]. Interestingly, the researchers also ranked the variables based on how much they impacted a film’s financial return, and found that a movie’s opening month had the highest correlation to gross income.

Image taken from official research report: https://www.iaeng.org/publication/IMECS2015/IMECS2015_pp274-279.pdf

Another study from Mumbai university in 2014 also employed the decision tree model in predicting gross income of Bollywood movies. But aside from using variables like budget, cast, and adult/PG rating that are more commonly employed in film predictive modeling, researchers also took into account the number of musical numbers present in the film. Despite the use of common analytical frameworks, it seems that film prediction structure could differ greatly within industries [9].

Monte Carlo Simulation

But even with these accurate estimates, there are insights in film analytics that regression or decision tree methods would still fail to point out. Film companies and financiers rarely develop projects one at a time, but instead balance multiple productions. To maximize the return from investing in multiple movies, studios employ a different model. The Monte Carlo simulation is a technique that uses random sampling to calculate the probability of various outcomes by running repeated iterations [10]. As such, the simulation is able to account for inherent risks within variables of uncertainty like how much more budget would help a comedy film as opposed to a horror movie. This unique attribute makes the technique especially great for assessing the risk associated with balancing or investing in multiple movies at a time. Irving Ebert, an investor working for the Ottawa Angels Alliance, used the Monte Carlo simulation to approximate an overall return of 75th percentile with investments to 50 films [11]. As the risk of investing in a single film is high even with quantitative backing, this method offers insights to portfolio management that regression analysis or decision tree models would fail to point out.

Image taken from official research report: https://nofilmschool.com/2016/04/

However, the effectiveness of the monte carlo simulation in predicting movie performance is also debatable. Relativity Media, a film investment startup founded in 2004, claimed that their use of the Monte Carlo model was the future of Hollywood. But after drawing in massive investments from multiple hedge funds, Relativity Media filed for bankruptcy just 7 years later [11]. It seems that simply employing the right data modeling tools isn’t enough. In order to effectively predict a film’s success, production companies and film investors also need the right analytical minds to structure these predictive frameworks for their specific needs. Regardless, these examples demonstrate the already prevalent use and the potential role of the Monte Carlo simulation in the film industry.

Other Methods

Aside from these more popular models and techniques, production companies and film investors have also developed a multitude of predictive algorithms, each relating unique variables to a film’s potential success. Cinelytic, a film investing startup, utilizes a sophisticated combination of natural language processing techniques to analyze movie trailers. The platform identifies and extracts key elements such as visual cues, audio patterns, and narrative elements that correspond to box office outcomes [12]. Another film analytic platform, Slated, uses similar combinations of natural language processing combined with neural networks to structure revenue forecasting [13]. Regardless of their methods, these examples illustrate the already prevalent role of predictive modeling and data analytics in the entertainment industry.

Image taken from the official cinelytic site: https://www.cinelytic.com/platform/

But with these advancements, how has the filmmaking process changed? And what will this mean for the entertainment landscape? Stay tuned for part 2, where we will explore the other ways film analytics have been employed beyond movie sales predictions, and how these technologies will change the film industry and the movies we will see in the future.

References:

[1] P. Debruge, With One Line, William Goldman Taught Hollywood Everything It Needed to Know (2019), Variety

[2] G. Ozroll, Alejandro Gonzalez Inarritu (2019) , IMDB

[3] G. Castillo, Machine Learning Regression Explained (2019), Seldon

[4] Y. Ni, Movie Box Office Prediction Based on Multivariate Linear Regression (2022), MDIP

[5] D. Siyan, W. Ting, L. Xin, Film Box Office Prediction based on Multiple Linear Regression (2019), Proquest

[6] Du, C. (Interviewer). (2023, November 1), Y.Ni

[7] B. Gopulan, Is Decision Tree a classification or regression model? (2019), Numpy

[8] M. Burgos, M. Campanario, J. Lara, D. Lizcano , Using Decision Trees to Characterize and Predict Movie Profitability on the US Market (2015), ICDM

[9] A. Mandhare, Application of Decision Tree to Predict Gross Income of a Movie (2014), Mumbai University

[10] A. Agarwal, Monte Carlo Simulation (2022), Investopedia

[11] A. Zheng, Why the Algorithm That Promised to Save Hollywood Destroyed Relativity Media, (2017), No Film School

[12] J.Vincent, Warner Bros. signs AI startup that claims to predict film success (2022), The Verge

[13] G. Doucette, How Data Analytics is changing film packaging, financing, and distribution (2022), Medium

--

--

Christopher Du
NYU Data Science Review

Film & TV Major at NYU Tisch | NYU Data Science Club | Passionate about Data Analytics, Entrepreneurship, Finance