Linear Regression Part I — Box Office Revenues/Critic Score Prediction of movies made by Marvel Studios/Other Studios

Anaswar Jayakumar
32 min readFeb 25, 2024

--

Overview

This project is the first of a two part series that involves the analysis of movie data, specifically box office revenues and critic scores of movies. The first part of this series involves the analysis of the box office revenues and critic scores of movies that were made by Marvel Studios as well as other movie studios while the second part of this series involves the analysis of box office revenues and critic scores of movies that were made by Marvel Studios, specifically movies that are part of Marvel Cinematic Universe (MCU).

Marvel Studios, a subsidiary of Walt Disney Studios, creates films, TV series, and digital content based on Marvel Comics characters. Since 2008, it has produced 33 films and 10 TV series, notably contributing to the success of the Marvel Cinematic Universe (MCU). Established in 1996, Marvel Studios gained momentum with films like “Blade” and “X-Men” before Disney’s acquisition in 2009. Kevin Feige leads the studio, which moved to the Disney lot in 2017 and has expanded into Disney+ series production. Feige’s role evolved to Chief Creative Officer of Marvel in 2019. Marvel Studios also secured a licensing deal with Stan Lee Universe and regained distribution rights to “The Incredible Hulk.” Marvel Studios Animation was launched for animated content, including “What If…?” and “X-Men ‘97”. (https://en.wikipedia.org/wiki/Marvel_Studios)

In this particular dataset, data pertaining to box office revenue was gathered from a film industry data website called The Numbers which tracks box office revenue in a systematic, algorithmic way. The Numbers, established by Bruce Nash in 1997 and owned by Nash Information Services, is a consulting firm based in Los Angeles and provides free access to movie business information. It serves industry professionals, investors, and movie enthusiasts, initially tracking 300 movies and now expanded to over 20,000. Nash Information Services offers various data and research services, serving clients like major studios and investors. Their services include revenue estimation, future release analysis, and customized reports like the Comp Analysis Report. They provide sophisticated modeling tools and direct access to movie data through OpusData, with the goal of keeping The Numbers as a free resource for all interested in the movie industry.

https://en.wikipedia.org/wiki/The_Numbers_(website)

https://www.the-numbers.com/research-analysis

https://the-numbers.com/

In this project, Python was the language of choice although R could have certainly been used as well. I personally find that Python is much more suited compared to R as the regression analysis portion of this assignment will involve machine learning techniques that are better suited for Python compared to R. Data was obtained from Kaggle, an online website that hosts various data science competitions. The following is the link to the CSV file that was used for this project: https://www.kaggle.com/datasets/monkeybusiness7/marvel-cinematic-universe-box-office/data?select=mcu_box_office.csv

Objective

The objective of this project is to predict the following dependent variables of interest: worldwide box office revenues adjusted to today’s inflation rates (InflationAdjustedWorldwide), opening weekend revenues adjusted to today’s inflation rates (InflationAdjustedOpeningWeekend), and meta score of a particular movie (MetaScore).

The first two dependent variables are of interest in this project because they provide insight into movie box office revenues when adjusted for inflation. Adjusting for inflation when analyzing movie box office performance is important because it provides a more accurate representation of a film’s financial success over time. Without adjusting for inflation, the nominal box office revenue figures may not reflect the true value of the revenue earned by a movie, as the purchasing power of currency changes over time due to inflation. By adjusting for inflation, we can compare the financial performance of movies across different time periods on a more equal footing and make more informed decisions about their success relative to each other.

On the other hand, the last dependent variable is of interest because in essence, the metascore is a weighted average of many reviews coming from reputed critics. The Metacritic team reads the reviews and assigns each a 0–100 score, which is then given a weight, mainly based on the review’s quality and source. That being said, there are a few notable downsides with using the metascore when gauging and evaluating movie ratings, some of which include the following:

  • The weighting coefficients are confidential, so you won’t get to see the extent to which each review counted in the metascore.
  • You’ll have a rough time finding metascores for less-known movies that appeared before 1999, the year Metacritic was created.
  • Some recent movies whose main language is not English aren’t even listed on Metacritic. For example, the Romanian movies Two Lottery Tickets (2016) and Eastern Business (2016) are not listed on Metacritic, while they are on IMDB, with ratings.

https://www.freecodecamp.org/news/whose-reviews-should-you-trust-imdb-rotten-tomatoes-metacritic-or-fandango-7d1010c6cf19/

In order to predict the dependent variables of interest, linear regression models will be implemented. This project will entail analyzing the following aspect of the box office revenue and critic scores of movies produced by movie studios as well as Marvel Studios dataset:

  • Movie Release, Ownership, Budget, Run Time
  • Domestic Box Office Revenue, Opening Weekend
  • International, Worldwide Box Office Revenue
  • Audience Reviews

Review of Data Sources

Two such datasets were used to predict box office revenue and critic scores of movies produced by movie studios including Marvel Studios as well as box office revenue and critic scores of movies produced by Marvel Studios under the Marvel Cinematic Universe (MCU) movie franchise: marvel_box_office.csv and mcu_box_office.csv. The pandas library was then used to load the datasets into the respective dataframes: all_movies_data and mcu_movie_data. The all_movies_data dataframe is of size 66, 23 (66 rows by 23 columns) while the mcu_movie_data dataframe is of size 33, 15 (33 rows by 15 columns). In both dataframes, the majority of the columns present are non — object (numerical) columns while a few columns are object columns as well. Lastly, the all_movies_data dataframe does have a presence of null values as indicated by 33 null values present in the column Phase while in the case of the mcu_movie_data dataframe, null values are not present. Therefore, imputation is required for the all_movies_data dataframe but not for the mcu_movie_data

Data Preparation

Prior to preparing the data for analysis, copies of the dataframes were created in order to ensure that the original dataframe stays intact. The first step in the data preparation process was to rename the columns and remove extraneous characters such as spaces, hyphens, etc.

After renaming the columns, the next step was to impute the null values that are present in the dataframes. As noted earlier, null values were present in the column Phase in the all_movies_data dataframe and in order to impute the null values, the mode was the value of choice. The mode was the value of choice to impute the null values that are present because the column Phase is a non — numerical (object) column rather than a numerical (non — object) column and therefore, using the mode as the value of choice to impute the null values made more sense. At this point, the null values have been imputed using mode imputation.

The next step in the data preparation process is to check if the columns are of the correct data type and convert any columns whose data type is incorrectly identified. Sometimes Python ends up incorrectly identifying column data types and so this step is integral in ensuring data quality and integrity. Fortunately in the case of the all_movies_data_copy dataframe, the corresponding column data types are correctly identified and so this step wasnt not performed.

The last step in the data preparation process is to convert categorical columns into numerical columns via a mapping. Depending on the number of categories that as certain categorical variable has, categorical variables can certainly be used as predictor variables and therefore, should be converted to a numerical variable. In order to convert categorical variables to numerical variables, a mapping was implemented in which a specific value is assigned to a specific category. For example, the categorical variable Ownership has five such categories (Marvel Studios, Sony Pictures, 20th Century Fox, Lionsgate Films, and New Line Cinema) and therefore, a value of 1 will be assigned to ownership category 1 (Sony Pictures), a value of 2 will be assigned to ownership category 2 (Sony Pictures), etc. A similar mapping was implemented for the categorical variable ReleaseMonth as well

The next step of this project is to perform exploratory data analysis (EDA) and the following variables were used to predict box office revenue and critic scores of movies produced by Marvel Studios as well as other movie studios:

Predictor Variables — Movie Release, Ownership, Budget, Run Time

  • Release Month — month movie was released in
  • Release Day — day movie was released on
  • Release Year — year movie was released in
  • Ownership — movie studio that produced movie
  • Budget — movie budget
  • Inflation Adjusted Budget — movie budget adjusted for current inflation rates
  • Run Time In Minutes — duration of movie in minutes

Domestic Box Office Revenue, Opening Weekend

  • Domestic Box Office — box office revenues in the US
  • Inflation Adjusted Domestic — domestic box office revenues adjusted for current inflation rates
  • Opening Weekend — revenues after the first weekend it was released

International, Worldwide Box Office Revenue

  • International Box Office — box office revenues everywhere but the US
  • Inflation Adjusted International — international box office revenues adjusted for current inflation rates
  • Worldwide Box Office — total box office revenues (domestic + international)

Audience Reviews

  • IMDB Score — score of movie on IMDB
  • Tomatometer — score from Rotten Tomatoes. (The number of positive reviews from professional critics)
  • Rotten Tomato Audience Score — how audience from Rotten Tomatoes scored the movie

Exploratory Data Analysis (EDA)

EDA was the next step of this project, the goal being to get a better understanding of the data at large. EDA is comprised of three such components: descriptive statistics, histograms, and correlation analysis. For the purposes of this article, I will focus the EDA more on the histograms and the correlation analysis since both were instrumental in the subsequent regression analysis portion of this project.

Histograms were generated to better understand the underlying distribution of the independent variables while correlation analysis was instrumental in determining the predictor variables that will ultimately be used to predict box office revenue and critic scores of movies produced by Marvel Studios as well as other movie studios. In particular, the EDA focused on the following aspects of the box office revenue and critic scores of movies produced by movie studios dataset:

  • Predictor Variables — Movie Release, Ownership, Budget, Run Time
  • Domestic Box Office Revenue, Opening Weekend
  • International, Worldwide Box Office Revenue
  • Audience Reviews

Histograms — Movie Release, Ownership, Budget, Run Time

The distributions of the variables ReleaseMonth and ReleaseDay both seem to mostly resemble positively (right) skewed while the distribution of the variable ReleaseYear seems to be a bit unclear. The mean of the variables ReleaseMonth, ReleaseDay, and ReleaseYear are 4.151515, 10.469697, and 2013.469697 respectively while the standard deviation are 3.084664, 8.304252, and 6.612854 respectively. The distributions of the variables imply the following:

  • The distribution of the variable ReleaseMonth shows a pronounced positive skew. On average, the bulk of movies are typically released between months 2 and 4 (February to April), indicating a concentration of releases during this period. However, there is a notable elongated tail extending towards later months (May through December), suggesting that while the majority of releases occur earlier in the year, there is still a significant number of movies distributed throughout the latter part of the year. This positively skewed distribution implies that while the central tendency of release dates is concentrated in the early months, there are instances of movies being released later in the year, albeit less frequently.
  • In the distribution of the variable ReleaseDay, most movies tend to be released early in the month, typically within the first ten days. However, the distribution is positively skewed, indicating that there’s a significant number of movies released later in the month as well, especially between day 15 and day 30. This skewness suggests that while the majority of releases occur earlier, there’s still a notable presence of movies hitting theaters in the latter part of the month. This phenomenon could be attributed to various factors such as strategic scheduling, marketing considerations, or production timelines.
  • The distribution analysis of the variable ReleaseYear presents some ambiguity, making it challenging to derive a definitive conclusion. However, upon closer examination, certain trends emerge. There appears to be a notable peak around the year 2020, suggesting a substantial influx of movie releases during that period. Additionally, another peak is observed around 2005, indicating a surge in movie production during that particular year. Furthermore, an interesting pattern emerges when considering years preceding 2015. There is a discernible decline in the number of movie releases leading up to 2005. However, contrary to the overall trend, the year 2005 exhibits a significant uptick in movie releases, signifying a notable shift in the industry dynamics during that time. This pattern extends further back, with years like 2000 showing a similar trend of increased movie production.

It’s important to note that fluctuations in movie release patterns can be influenced by various factors such as industry trends, economic conditions, technological advancements, and cultural phenomena. For instance, the surge in movie releases around 2020 might coincide with the proliferation of streaming platforms and the growing demand for diverse content. Similarly, the spike in 2005 could be attributed to the release of blockbuster films or shifts in audience preferences. Moreover, deeper analysis could involve examining genre-specific trends, regional variations, or the impact of major events on the film industry. Additionally, considering the distribution of movie budgets or box office performance alongside release years could provide further insights into the dynamics of the film industry over time.

The distribution of the variable Ownership, Budget, InflationAdjustedBudget, and RunTimeInMinutes all seem to mostly resemble positively (right) skewed. The mean of the variables Ownership, Budget, InflationAdjustedBudget, and RunTimeInMinutes are 1.969697, $161,784,800, $203,965,400, and 123.606061 respectively while the standard deviation are 1.149845, 73,959,790, 86,007,540, and 17.606711 respectively. The distributions of the variables imply the following:

  • The distribution of movie ownership reveals an interesting pattern. On average, the majority of movies are produced by prominent studios like Marvel Studios, Sony Pictures, and 20th Century Fox. However, this distribution is positively skewed, indicating a long tail of movies attributed to various other studios such as Lionsgate Films and New Line Cinema. This skewness implies that while a few studios dominate the market, there exists a diverse array of smaller studios contributing to the production landscape, enriching it with varied content and perspectives.
  • In the distribution of movie budgets, particularly within Marvel Studios as well as other studios, there is a notable skew towards higher values. On average, the bulk of movies fall within the range of approximately $100,000,000 to $200,000,000. However, the distribution is positively skewed, indicating that while most movies adhere to this range, there is a long tail of films with budgets exceeding this range. This skewness suggests that there are relatively fewer movies with exceptionally high budgets beyond $200,000,000, but they still exist, contributing to the overall distribution’s elongated right tail.
  • In the distribution of the variable “InflationAdjustedBudget,” the majority of movies produced by Marvel Studios as well as other studios, had an average inflation-adjusted budget of around $200,000,000. However, this distribution exhibits a positively skewed pattern, indicating that while most movies fall within this range, there is a notable number of films with significantly higher budgets. This long tail of higher budget movies suggests that a select few productions command budgets well beyond the average, potentially indicating large-scale blockbuster releases or high-profile projects within the industry.
  • The distribution of the variable RunTimeInMinutes reveals an interesting pattern, particularly within movies produced by Marvel Studios as well as other studios. On average, the majority of these movies fall within the range of approximately 120 to 140 minutes in duration. However, what distinguishes this distribution is its positively skewed nature. While most movies cluster around this central range, there exists a significant number of films with notably longer runtimes. This skewed tail illustrates that while the typical Marvel Studios movie may adhere to a certain runtime, there are also noteworthy outliers that contribute to a broader spectrum of viewing durations. This positively skewed distribution suggests that while the majority of movies may follow a certain trend, there’s still substantial variability in runtime among Marvel Studios productions, with some films significantly deviating from the norm towards longer durations.

Histograms — Domestic Box Office Revenue, Opening Weekend

The distribution of the variables DomesticBoxOffice, InflationAdjustedDomestic, and OpeningWeekend all seem to mostly resemble positively (right) skewed. The mean of the variables DomesticBoxOffice, InflationAdjustedDomestic, and OpeningWeekend are $260,921,300, $327,468,200, and $97,296,230 respectively while the standard deviation are $183,523,700, $214,986,900, and $66,748,790 respectively. The distributions of the variables imply the following:

  • The distribution of box office revenues for movies, particularly those produced by Marvel Studios as well as other studios, typically exhibits a positively skewed pattern. On average, these movies rake in between $200 million to $400 million at the United States box office. However, this average is heavily influenced by a few blockbuster hits that garner significantly higher revenues, pushing the tail of the distribution towards the higher end. This skewness suggests that while most movies fall within the mentioned range, there are notable outliers with exceptionally high box office earnings, reflecting the occasional phenomenal success of certain films in this category.
  • The distribution of the variable “InflationAdjustedDomestic” reveals an interesting trend, particularly when analyzing movies produced by Marvel Studios and other production houses. On average, a significant portion of these films falls within the range of $200,000,000 to $400,000,000 in inflation-adjusted domestic box office revenue. However, what makes this distribution noteworthy is its positively skewed nature. While most movies cluster around this average range, there exists a tail of movies with substantially higher inflation-adjusted domestic box office revenue, indicating occasional blockbuster hits that significantly outperform the average. This skew towards higher revenue values suggests that while the majority of films may perform well within a certain range, there are notable outliers that achieve extraordinary success at the box office.
  • In the distribution of the variable OpeningWeekend, the majority of films produced by Marvel Studios and other movie studios tend to generate opening weekend revenues ranging from approximately $50,000,000 to $100,000,000. However, the distribution is positively skewed, indicating a long tail of movies that surpass this range, with some achieving exceptionally high opening weekend revenues. This skewness implies that while the bulk of movies fall within the mentioned revenue range, there are notable outliers with significantly higher opening weekend earnings, possibly driven by blockbuster releases or highly anticipated sequels.

Histograms — International, Worldwide Box Office Revenue

The distribution of the variable InternationalBoxOffice and InflationAdjustedInternational both seem to mostly resemble positively (right) skewed while the distribution of the variable WorldwideBoxOffice seems to be a bit unclear. The mean of the variables InternationalBoxOffice, InflationAdjustedInternational, and WorldwideBoxOffice are $393,784,300, $487,371,200, and $654,705,600 respectively while the standard deviation are $332,294,000, $385,346,600, and $502,810,700.0 respectively. The distributions of the variables imply the following:

  • In examining the distribution of the International Box Office variable, it’s evident that the majority of films, including those from Marvel Studios and other production houses, tend to yield box office revenues ranging from approximately $250,000,000 to $500,000,000 outside the United States. However, this distribution is positively skewed, indicating a long tail of movies that surpass this range with significantly higher box office revenues internationally. This skewness suggests that while most movies fall within the mentioned revenue bracket, there exists a notable number of exceptionally successful films that substantially contribute to the overall distribution, extending the tail towards higher revenue values. Such skewness underscores the presence of outliers or exceptionally high-performing movies that exert a considerable influence on the distribution’s shape and dynamics.
  • In the distribution of the variable “InflationAdjustedInternational,” the majority of movies produced by Marvel Studios and other movie studios boast an inflation-adjusted box office revenue outside of the United States ranging between $250,000,000 and $500,000,000. However, this distribution exhibits a positively skewed pattern, indicating that while most films fall within this range, there is a significant number of outliers with substantially higher inflation-adjusted box office revenues. This skewness suggests that there are a few exceptionally successful movies that contribute to an extended tail on the higher end of the revenue spectrum, potentially indicating blockbuster hits or cult favorites that have enjoyed remarkable international success.
  • The distribution of the variable WorldwideBoxOffice presents an intriguing pattern. Initially, when considering movies with a worldwide box office revenue exceeding $1,000,000,000 (derived from the sum of domestic and international box office revenues), the distribution appears predominantly positively skewed. This suggests that a notable proportion of movies achieve exceptionally high revenues, pulling the distribution towards the higher end. However, upon closer examination of movies with worldwide box office revenues below $1,000,000,000, the distribution becomes less clear. There is no evident trend or conclusive pattern regarding the distribution of revenues for movies falling below this threshold. This ambiguity suggests a diverse range of performance outcomes within this subset of films. Additionally, it’s worth exploring further characteristics of this distribution, such as the frequency distribution of revenue ranges, the presence of outliers, and any potential factors influencing revenue disparities among films. Further analysis could shed light on the dynamics driving the variability in worldwide box office earnings across different movies.

Histograms — Audience Reviews

The variables IMDbScore, Tomatometer, and RottenTomatoAudienceScore all seem to mostly resemble negatively (left) skewed. The mean of the variables IMDbScore, Tomatometer, and RottenTomatoAudienceScore are 6.822727, 66.454545, and 74.196970 respectively while the standard deviation are 0.973463, 25.618612, and 18.372404 respectively. The distributions of the variables imply the following:

  • In the realm of IMDbScore distribution, it’s notable that the majority of films from Marvel Studios and other cinematic entities tend to exhibit an IMDb score ranging approximately between 6.5 and 8. However, this distribution is significantly left-skewed, indicating a prolonged tail of movies with lower IMDb scores. This skewness suggests that while most movies achieve ratings within the mid-range, there are outliers on the lower end of the spectrum, indicating a notable portion of films that receive lower scores from viewers and critics alike.
  • Similarly, when considering the Tomatometer distribution, it becomes evident that the average ratings for movies produced by Marvel Studios and other film studios typically fall within the range of 80 to 100. Yet, this distribution also skews leftward, signifying a substantial tail of movies with lower Rotten Tomatoes scores. Despite the majority of films garnering favorable reviews from professional critics, there exists a notable subset that receives comparatively lower ratings.
  • Lastly, in examining the RottenTomatoAudienceScore distribution, it’s apparent that the average scores for movies from Marvel Studios and other production houses typically hover between 70 and 100. However, akin to the other distributions, this distribution also demonstrates a leftward skew, indicating a prolonged tail of movies with lower Rotten Tomatoes Audience scores. Despite the general trend of positive audience reception, there remains a segment of films that elicit less favorable responses from viewers, contributing to the negative skew observed in the distribution.

Correlation Analysis

Correlation matrices were generated to better understand the relationship between the variables of interest and the following dependent (response) variables:

  • InflationAdjustedWorldwide — worldwide box office revenues adjusted to today’s inflation rates
  • InflationAdjustedOpeningWeekend — opening weekend box office revenues adjusted to today’s inflation rates
  • MetaScore — weighted average of many reviews coming from reputed critics

The correlation matrices will also be crucial in determining which variables of interest best predict InflationAdjustedWorldwide, InflationAdjustedOpeningWeekend, and MetaScore. In other words, the correlation matrices will be used to determine which variables of interest will end up being the independent variables in the regression model.

Its also worth noting that variables that either have a correlation greater than 0.3 or less than -0.3 are suitable variables for predicting the dependent variables of interest since a correlation of 0.3 indicates a moderate positive relationship while a correlation of -0.3 indicates a moderate negative relationship. While using the correlation values of the independent variables is certainly not a hard and fast rule for choosing the independent variables that best predict the dependent variables of interest, correlation values certainly serve as a guideline for choosing suitable and appropriate predictor variables for predicting the dependent variables of interest.

Dependent Variable 1 — InflationAdjustedWorldwide

Predictor Variables — Movie Release, Ownership, Budget, Run Time

The correlation between the dependent variable InflationAdjustedWorldwide and the following independent variables was determined: ReleaseMonth, ReleaseDay, ReleaseYear, Ownership, Budget, InflationAdjustedBudget, and RunTimeInMinutes.

The correlation between the dependent variable InflationAdjustedWorldwide and the independent variable Ownership signifies a strong negative relationship while the correlation between the dependent variable and the independent variable RunTimeInMinutes indicate a strong positive relationship. Likewise, the correlation values indicate a very strong positive relationship between the dependent variable and the independent variable Budget as well as a very strong positive relationship between the dependent variable and the independent variable InflationAdjustedBudget

On the other hand, the correlation values indicate a weak positive relationship between the dependent variable InflationAdjustedWorldwide and the independent variable ReleaseMonth, a negligible relationship between the dependent variable and the independent variable ReleaseDay, and a weak positive relationship between the dependent variable and the independent variable ReleaseYear

Therefore, based on the correlation values, the variables Ownership, Budget, InflationAdjustedBudget, and Ownership are good predictor variables of the dependent variable InflationAdjustedWorldwide while the variables ReleaseMonth, ReleaseDay, and ReleaseYear arent good predictor variables of the dependent variable

Predictor Variables — Domestic Box Office Revenue, Opening Weekend

The correlation between the dependent variable InflationAdjustedWorldwide and the following independent variables was determined: DomesticBoxOffice, InflationAdjustedDomestic, and OpeningWeekend. The correlation values all indicate a very strong positive relationship between the dependent variable and the independent variables. Therefore, based on the correlation values, the independent variables DomesticBoxOffice, InflationAdjustedDomestic, and OpeningWeekend are good predictor variables of the dependent variable InflationAdjustedWorldwide

Predictor Variables — International, Worldwide Box Office Revenue

The correlation between the dependent variable InflationAdjustedWorldwide and the following independent variables was determined: InternationalBoxOffice, InflationAdjustedInternational, and WorldwideBoxOffice. The correlation values all indicate a very strong positive relationship between the dependent variable and the independent variables. Therefore, based on the correlation values, the independent variables InternationalBoxOffice, InflationAdjustedInternational, and WorldwideBoxOffice are good predictor variables of the dependent variable InflationAdjustedWorldwide

Predictor Variables — Audience Reviews

The correlation between the dependent variable InflationAdjustedWorldwide and the following independent variables was determined: IMDbScore, Tomatometer, and RottenTomatoAudienceScore. The correlation values indicate a strong positive relationship between the dependent variable and independent variables and therefore, the variables IMDbScore, Tomatometer, and RottenTomatoAudienceScore are good predictor variables of the dependent variable InflationAdjustedWorldwide

Dependent Variable 2— InflationAdjustedOpeningWeekend

Predictor Variables — Movie Release, Ownership, Budget, Run Time

The correlation between the dependent variable InflationAdjustedOpeningWeekend and the following variables was determined: ReleaseMonth, ReleaseDay, ReleaseYear, Ownership, Budget, InflationAdjustedBudget, and RunTimeInMinutes.

The correlation between the dependent variable InflationAdjustedOpeningWeekend and the independent variable Ownership indicates a strong negative relationship while the correlation between the dependent variable and the independent variable RunTimeInMinutes indicates a strong positive relationship. Likewise, the correlation values indicate a very strong positive between the dependent variable InflationAdjustedOpeningWeekend and the following independent variables: Budget, InflationAdjustedBudget.

Conversely, the correlation between the dependent variable InflationAdjustedOpeningWeekend and the independent variable ReleaseMonth indicates a weak negative relationship while the correlation values indicate a negligible relationship between the dependent variable and the following independent variables: ReleaseDay, ReleaseYear.

Therefore, based on the correlation values, the variables Ownership, RunTimeInMinutes, Budget, and InflationAdjustedBudget are good predictor variables of the dependent variable InflationAdjustedOpeningWeekend while the variables ReleaseMonth, ReleaseDay, and ReleaseYear arent good predictor variables of the dependent variable

Predictor Variables — Domestic Box Office Revenue, Opening Weekend

The correlation between the dependent variable InflationAdjustedOpeningWeekend and the following independent variables was determined: DomesticBoxOffice, InflationAdjustedDomestic, and OpeningWeekend. The correlation values all indicate a very strong positive relationship between the dependent variable and the independent variables. Therefore, based on the correlation values, the independent variables DomesticBoxOffice, InflationAdjustedDomestic, and OpeningWeekend are good predictor variables of the dependent variable InflationAdjustedOpeningWeekend

Predictor Variables — International, Worldwide Box Office Revenue

The correlation between the dependent variable InflationAdjustedOpeningWeekend and the following independent variables was determined: InternationalBoxOffice, InflationAdjustedInternational, and WorldwideBoxOffice. The correlation values all indicate a very strong positive relationship between the dependent variable and the independent variables. Therefore, based on the correlation values, the independent variables InternationalBoxOffice, InflationAdjustedInternational, and WorldwideBoxOffice are good predictor variables of the dependent variable InflationAdjustedOpeningWeekend

Predictor Variables — Audience Reviews

The correlation between the dependent variable InflationAdjustedOpeningWeekend and the following independent variables was determined: IMDbScore, Tomatometer, and RottenTomatoAudienceScore. The correlation values indicate a strong positive relationship between the dependent variable and independent variables and therefore, the variables IMDbScore, Tomatometer, and RottenTomatoAudienceScore are good predictor variables of the dependent variable InflationAdjustedOpeningWeekend

Dependent Variable 3— MetaScore

Predictor Variables — Movie Release, Ownership, Budget, Run Time

The correlation between the dependent variable MetaScore and the following variables was determined: ReleaseMonth, ReleaseDay, ReleaseYear, Ownership, Budget, InflationAdjustedBudget, and RunTimeInMinutes.

The correlation between the dependent variable MetaScore and the independent variable Ownership indicates a strong negative relationship while the correlation between the dependent variable and the independent variable RunTimeInMinutes indicates a strong positive relationship. Likewise, the correlation values indicate a very strong positive relationship between the dependent variable MetaScore and the following independent variables: Budget, InflationAdjustedBudget.

Conversely, the correlation between the dependent variable MetaScore and the independent variable ReleaseMonth indicates a moderate negative relationship while the correlation values indicate a weak positive relationship between the dependent variable and the independent variable ReleaseYear. Lastly, the correlation between the dependent variable MetaScore and the independent variable ReleaseDay indicates a negligible relationship

Therefore, based on the correlation values, the variables Ownership, RunTimeInMinutes, Budget, InflationAdjustedBudget, and ReleaseMonth are good predictor variables of the dependent variable MetaScore while the variables ReleaseYear and ReleaseDay arent good predictor variables of the dependent variable.

Predictor Variables — Domestic Box Office Revenue, Opening Weekend

The correlation between the dependent variable MetaScore and the following independent variables was determined: DomesticBoxOffice, InflationAdjustedDomestic, and OpeningWeekend. The correlation values indicate a strong positive relationship between the dependent variable and the independent variables OpeningWeekend and DomesticBoxOffice while the correlation between the dependent variable and the independent variable InflationAdjustedDomestic indicates a strong positive relationship. Therefore, based on the correlation values, the independent variables DomesticBoxOffice, InflationAdjustedDomestic, and OpeningWeekend are good predictor variables of the dependent variable MetaScore

Predictor Variables — International, Worldwide Box Office Revenue

The correlation between the dependent variable MetaScore and the following independent variables was determined: InternationalBoxOffice, InflationAdjustedInternational, and WorldwideBoxOffice. The correlation values all indicate a strong positive relationship between the dependent variable and the independent variables. Therefore, based on the correlation values, the independent variables InternationalBoxOffice, InflationAdjustedInternational, and WorldwideBoxOffice are good predictor variables of the dependent variable MetaScore

Predictor Variables — Audience Reviews

The correlation between the dependent variable MetaScore and the following independent variables was determined: IMDbScore, Tomatometer, and RottenTomatoAudienceScore. The correlation values indicate a very strong positive relationship between the dependent variable and independent variables. Therefore, based on the correlation values, the variables IMDbScore, Tomatometer, and RottenTomatoAudienceScore are good predictor variables of the dependent variable MetaScore

Regression Analysis

Now that the EDA portion has been completed, the last step is to perform a regression analysis in order to determine the best performing model and ultimately which model best predicts the following dependent (response) variables:

  • Inflation Adjusted Worldwide
  • Inflation Adjusted Opening Weekend
  • MetaScore

As part of the regression analysis, a series of initial and additional models were created to predict worldwide box office revenues adjusted to today’s inflation rates, opening weekend box office revenues adjusted to today’s inflation rates, and IMDB metascore. In particular, a total of four such initial models and seven such additional models were created as part of the regression analysis, resulting in eleven such models for each of the dependent variables of interest. For the purposes of this article, the best performing model from the set of initial and additional models will be evaluated for each dependent variable

In order to evaluate model performance, linear regression metrics such as R-Squared and RMSE will be used, with R-Squared representing model goodness of fit and RMSE representing the average distance between the predicted values from the model and the actual values or in other words how close or far the residuals (measure of how far from the regression line data points are) are from the regression line of best fit.

In summary, both the RMSE and R-Squared measures a linear regression model goodness of fit. Ideally, the model that has the highest R-Squared and the lowest RMSE is the model of choice for predicting worldwide box office revenues adjusted to today’s inflation rates, opening weekend box office revenues adjusted to today’s inflation rates, and IMDB metascore. The best performing models from the set of initial and additional models will be chosen to evaluate model performance. Lastly, its worth noting that a train-test split was not implemented given that the size of the movie studio production dataset was quite small (66 rows, 23 columns)

Dependent Variable 1 — Inflation Adjusted Worldwide

Best Performing Model — Initial Models

Based on the provided RMSE and R-Squared values for the models, here’s the analysis:

Analysis of R-Squared:

  • Model 3 has an R-Squared value of 99.94, indicating an exceptionally high proportion (99.94%) of the variance in the response variable explained by the predictor variable(s) in the model.
  • Model 2 follows with an R-Squared value of 91.91, still demonstrating a strong explanatory power but noticeably lower than Model 3.
  • Model 1 exhibits an R-Squared value of 68.11, indicating a moderate proportion of variance explained.
  • Model 4 has the lowest R-Squared value of 48.36, suggesting the least proportion of variance explained among the models.

Analysis of RMSE:

  • Model 3 has the lowest RMSE of 1.406774e+07, indicating a relatively small average distance between the observed data values and the predicted values.
  • Model 2 has a higher RMSE compared to Model 3 but is still significantly lower than Models 1 and 4.
  • Model 1 has a higher RMSE compared to Models 2 and 3 but performs better than Model 4.
  • Model 4 has the highest RMSE, indicating the largest average distance between the observed data values and the predicted values among the models.

Overall Implications:

  • Model 3 emerges as the best-performing model considering both R-Squared and RMSE values. It has the highest R-Squared value, indicating strong explanatory power, and the lowest RMSE value, indicating accurate predictions.
  • Model 2 also performs well, with a relatively high R-Squared value but a higher RMSE compared to Model 3.
  • Model 1 performs moderately well but is outperformed by both Model 3 and Model 2 in terms of both R-Squared and RMSE.
  • Model 4 has the weakest performance among the models, with the lowest R-Squared value and the highest RMSE.

In summary, Model 3 stands out as the best choice, providing the highest R-Squared value and the lowest RMSE value, indicating both strong explanatory power and accurate predictions. However, it’s essential to consider the specific context and requirements of the analysis before finalizing the choice of model.

Best Performing Model — Additional Models

Based on the provided RMSE and R-Squared values for the models, here’s the analysis:

Analysis of R-Squared:

  • Model 5 and Model 9 have R-Squared values of 100.00%, indicating perfect explanatory power. However, these values are potentially indicative of overfitting or issues with the data rather than genuine model performance, as achieving a perfect R-Squared is rare and often suspicious.
  • Model 7 has an R-Squared value of 99.96%, closely following the perfect scores of Models 5 and 9, suggesting extremely high explanatory power. Model 11 follows with an R-Squared value of 99.95%, also demonstrating excellent explanatory capability.
  • Model 6 has an R-Squared value of 93.23%, indicating a strong proportion of variance explained, though lower than the aforementioned models. Model 10 has a slightly lower R-Squared value of 92.95%, still showing a relatively high explanatory power compared to the remaining models.
  • Model 8 exhibits a lower R-Squared value of 79.11%, indicating a moderate proportion of variance explained, comparatively weaker than the other models.

Analysis of RMSE:

  • Model 5 and Model 9 have RMSE values of 0.000000e+00, implying a perfect fit between observed and predicted values. However, similarly to the R-Squared values, this might signal potential issues such as overfitting.
  • Model 7 has the lowest RMSE among the models with a non-zero value, indicating a relatively small average distance between observed and predicted values. Model 11 follows with a slightly higher RMSE than Model 7 but still performs well compared to the other models.
  • Model 6 has a significantly higher RMSE compared to Models 7 and 11 but is still relatively lower compared to the remaining models. Model 10 has a slightly higher RMSE than Model 6, suggesting slightly poorer predictive accuracy.
  • Model 8 has the highest RMSE among the models, indicating the largest average distance between observed and predicted values.

Overall Implications:

  • Models 5 and 9 achieve perfect scores for both R-Squared and RMSE, which could be indicative of overfitting or data issues rather than genuine model performance. These models should be approached with caution.
  • Models 7 and 11 perform exceptionally well, with high R-Squared values and low RMSE, suggesting strong explanatory power and accurate predictions.
  • Models 6 and 10 exhibit slightly lower R-Squared values and higher RMSE compared to Models 7 and 11 but still perform relatively well overall.
  • Model 8, while having a moderate R-Squared value, shows the highest RMSE among the models, indicating poorer predictive accuracy compared to the others.

In summary, Models 7 and 11 appear to be the best choices, as they demonstrate excellent performance in terms of both R-Squared and RMSE, indicating strong explanatory power and accurate predictions. Models 5 and 9, despite achieving perfect scores, should be treated with caution due to the possibility of overfitting or data issues. It’s essential to consider the specific context and potential limitations of each model before making any final decisions.

Dependent Variable 2 — Inflation Adjusted Opening Weekend

Best Performing Model — Initial Models

Based on the provided RMSE and R-Squared values for the models, here’s the analysis:

Analysis of R-Squared:

  • Model 13 has the highest R-Squared value (99.68%), indicating that it explains the largest proportion of the variance in the response variable among the models provided.
  • Model 14 follows with an R-Squared of 92.50%, which is significantly lower than Model 13 but still relatively high.
  • Model 12 and Model 15 have substantially lower R-Squared values, indicating that they explain less of the variance in the response variable compared to Model 13 and Model 14.

Analysis of RMSE:

  • Model 13 has the lowest RMSE (4285176.85), indicating that it has the smallest average distance between the observed and predicted data points among the models.
  • Model 14 has a higher RMSE compared to Model 13, indicating that its predictions have a larger average distance from the observed data points.
  • Model 12 and Model 15 have even higher RMSE values, suggesting poorer performance in terms of prediction accuracy compared to Model 13 and Model 14.

Overall Implications:

  • Model 13 appears to be the best performing model based on both R-Squared and RMSE. It explains a high proportion of the variance in the response variable (99.68%) while also having the smallest average prediction error.
  • Model 14, while having a lower R-Squared and higher RMSE compared to Model 13, still performs reasonably well.
  • Model 12 and Model 15 have significantly lower R-Squared values and higher RMSE values, indicating poorer performance in both explaining variance and making accurate predictions compared to the other models.

In conclusion, Model 13 is the preferred choice among the provided models due to its high R-Squared value and low RMSE, suggesting it provides the best balance of explaining the variance in the data and making accurate predictions. However, it’s worth investigating further to understand why the RMSE values are still quite high across all models, as this suggests room for improvement in predictive accuracy.

Best Performing Model — Additional Models

Based on the provided RMSE and R-Squared values for the models, here’s the analysis:

Analysis of R-Squared:

  • Models 16, 17, 21, and 20 have exceptionally high R-Squared values ranging from 99.69% to 99.82%, indicating that they explain a significant portion of the variance in the response variable.
  • Models 18 and 22 have lower but still relatively high R-Squared values of 94.09% and 92.69%, respectively.
  • Model 19 has the lowest R-Squared value among all models at 72.96%, indicating it explains a smaller proportion of the variance compared to the other models.

Analysis of RMSE:

  • Models 16, 17, 21, and 20 have relatively low RMSE values, indicating they have smaller average distances between observed and predicted data points.
  • Models 18 and 22 have higher RMSE values compared to Models 16, 17, 21, and 20, suggesting larger prediction errors.
  • Model 19 has the highest RMSE value among all models, indicating it has the largest average distance between observed and predicted data points, implying poorer model fit compared to the other models.

Overall Implications:

  • Models 16, 17, 21, and 20 perform exceptionally well in terms of both R-Squared and RMSE values, indicating they explain a large proportion of the variance in the response variable and make accurate predictions with relatively small errors.
  • Models 18 and 22 have slightly lower R-Squared values but still perform reasonably well. However, their RMSE values are higher compared to Models 16, 17, 21, and 20, suggesting slightly poorer predictive accuracy.
  • Model 19, while having a high R-Squared value, has a significantly higher RMSE compared to the other models, indicating poorer predictive accuracy and suggesting that it might not be as reliable for making accurate predictions compared to the other models.

In conclusion, Models 16, 17, 21, and 20 appear to be the best performing models overall due to their combination of high R-Squared values and low RMSE values, indicating strong explanatory power and predictive accuracy. Models 18 and 22 also perform reasonably well but have slightly higher RMSE values. Model 19, despite its high R-Squared value, has a substantially higher RMSE, indicating it might not be as reliable for making accurate predictions.

Dependent Variable 3— MetaScore

Best Performing Model — Initial Models

Analysis of R-Squared:

  • Model 26 has an R-Squared value of 92.23, indicating that it explains a high proportion (92.23%) of the variance in the response variable. Model 23 follows with an R-Squared value of 56.20, still indicating a moderate explanatory power but considerably less than Model 26.
  • Model 24 has an R-Squared value of 52.61, indicating a lower proportion of variance explained compared to Models 26 and 23. Model 25 has the lowest R-Squared value of 51.97, suggesting the least proportion of variance explained among the models.

Analysis of RMSE:

  • Model 26 has the lowest RMSE of 4.05, indicating a relatively small average distance between the observed data values and the predicted values. Model 23 has a higher RMSE compared to Model 26 but is still significantly lower than Models 24 and 25.
  • Model 24 has a higher RMSE compared to Models 26 and 23 but performs better than Model 25. Model 25 has the highest RMSE, indicating the largest average distance between the observed data values and the predicted values among the models

Overall Implications:

  • Model 26 is the best performing model considering both R-Squared and RMSE values. It has the highest R-Squared value, indicating strong explanatory power, and the lowest RMSE value, indicating accurate predictions.
  • Model 23 also performs well, with a relatively high R-Squared value but a higher RMSE compared to Model 26. Models 24 and 25 perform less favorably compared to Models 26 and 23 in terms of both R-Squared and RMSE.

In summary, Model 26 stands out as the best choice, providing the highest R-Squared value and the lowest RMSE value, indicating both strong explanatory power and accurate predictions. However, it’s essential to consider the specific context and requirements of the analysis before finalizing the choice of model.

Best Performing Model — Additional Models

Analysis of R-Squared:

  • Model 27 has an R-Squared value of 95.60, indicating that it explains a high proportion (95.60%) of the variance in the response variable. Model 30 follows closely with an R-Squared value of 94.87, indicating a strong explanatory power, although slightly less than Model 27.
  • Model 32 has an R-Squared value of 93.13, indicating a slightly lower proportion of variance explained compared to Models 27 and 30. Model 33 also has a relatively high R-Squared value of 93.07, similar to Model 32.
  • Model 28 has a moderate R-Squared value of 67.44, indicating a lower proportion of variance explained compared to the earlier models. Model 29 follows with an R-Squared value of 65.64, slightly lower than Model 28.
  • Model 31 has the lowest R-Squared value of 53.75, suggesting the least proportion of variance explained among the models.

Analysis of RMSE:

  • Model 27 has the lowest RMSE of 3.05, indicating a relatively small average distance between the observed data values and the predicted values. Model 30 follows closely with an RMSE of 3.29, indicating a slightly higher but still low average distance compared to Model 27.
  • Model 32 has an RMSE of 3.81, indicating a slightly higher average distance compared to Models 27 and 30. Model 33 also has a relatively low RMSE of 3.82, similar to Model 32.
  • Model 28 has a higher RMSE of 8.29, indicating a larger average distance between the observed data values and the predicted values compared to the earlier models. Model 29 follows with an RMSE of 8.51, slightly higher than Model 28.
  • Model 31 has the highest RMSE of 9.88, indicating the largest average distance among the models.

Overall Implications:

  • Models 27, 30, 32, and 33 are the best performing models considering both R-Squared and RMSE values. They have relatively high R-Squared values, indicating strong explanatory power, and low RMSE values, indicating accurate predictions.
  • Models 28 and 29 perform moderately well in terms of R-Squared but have higher RMSE values compared to Models 27, 30, 32, and 33, suggesting less accurate predictions.
  • Model 31 has the weakest performance among the models, with the lowest R-Squared value and the highest RMSE, indicating both lower explanatory power and less accurate predictions.

In summary, Models 27, 30, 32, and 33 stand out as the best choices, providing high R-Squared values and low RMSE values, indicating both strong explanatory power and accurate predictions. However, it’s essential to consider the specific context and requirements of the analysis before finalizing the choice of model.

--

--

Anaswar Jayakumar

Data Scientist - Leverages data science and statistical techniques to make recommendations that align with business priorities.