How to Use Linear Regression for Predicting Movie Revenue

Merve Cengiz
İstanbul Data Science Academy
6 min readFeb 23, 2022

Let’s admit it, when we want to watch a movie, we get advice from someone or we look at the comments or check its rating. Because we want to watch a great movie. I think everyone agrees that IMDB.com is the best site in this regard. Many features make a movie great and popular especially when it generates high revenue. So, can we find out which features are important for a movie?

For this purpose, in the 2nd project at Istanbul Data Science Academy Bootcamp, my team and I scraped data from imdb.com using the BeautifulSoup library in Python. Our aim was to find out which variables impact to the popularity of the top 1000 movies according to imdb.com, and to predict the revenue of the upcoming movies. Below is the project that we have done:

Methodology:

1- Extract features by using Web Scraping from IMDB.com,

2- Preprocess the data (Data cleaning, missing value imputation),

3- Exploratory Data Analysis (by using Tableau)

4- Build ML models (Linear Regression, Lasso and Ridge Regression)

1- Web Scraping:

Collecting data from a website and learn its structure by parsing with BeautifulSoup and obtain a dataset.

To find an element we want, a function was defined for requesting and parsing the IMDB movie link page.

For example, when you call the function with an URL, you can easily get a specific element by using “find()”.

We created code to get the following features for the movies on IMDB.com: Movie Name · Released Year · Runtime · Budget · Genres (Genre1, Genre2, Genre3) · Rating · User Reviews · Critic Reviews · Metascore · Released Date · Count Videos · Count Photos · Trailer Duration · Director · Stars (Star1, Star2, Star3) · Companies (Company1, Company2, Company3) · Movie Revenue (Gross International)

You can visit my GitHub page for full code of webscraping from IMDB.com.

2- Preprocess the data:

  1. Data Cleaning

As you can see from the data, the variables are too messy to analyse directly.

We have used functions to split, strip and remove unnecessary symbols or numbers etc. from variables (Github) and finally the data was ready to implement other techniques like impute missing values, EDA and so on.

2. Missing Value Imputation

You can see the features have null values both ways:

This matrix shows us whether the null values are random or correlated to other features. I mean, for a movie count of videos-photos and trailer duration they are all missing. The budget may be associated with metascore or gross (revenue). We can see which features are correlated in terms of missing values with the heatmap below:

If missing values are correlated to other features, then we can’t fill it with mean or median. Therefore, we used the K-Nearest Neighbour technique as a missing value prediction. Besides Month, a categorical variable, was filled with “February” as its mode value.

3. Exploratory Data Analysis

We have executed EDA by using Tableau which is one of great visualization tool for Data Science. It is so effective for creating dynamic plots and easy to use. Besides, you can download data from which you created graphs.

The plot below shows the max avenue by each genre:

Movies within the action and drama genres generated the most revenue between 14 genres in total.

You can see in which movies the directors achieved the maximum revenue.

You can see in which movies the stars achieved the maximum revenue.

You can see in which movies the movie companies achieved the maximum revenue.

We also created plots include sum revenue for each director/star and company and used the data for creating grouped features. You can find all plots about the data in my Tableau profile.

Feature Extraction

First of all, I grouped the total revenue by stars and categorized them as top, very high, high … lowest:

The procedure was also performed for company and directors. Then I merged 3 dataframes by using only keys from left frame.

Then, I grouped each quantitative variable by evaluating their descriptive statistics and distribution.

The feature “user_reviews” ranges from 22 and 11100, we categorized it with 6 groups using qcut function since I didn’t want to change the distribution. qcut() categorizes the feature by using its quantiles:

The feature “budget” ranges from 18000 and 30000000000 so the standart deviation is high, we can see that from the difference between 75th percentile and the maximum value:

We categorized it with 10 groups:

Max values of “count_photos” and “count_videos” were 99. While 75% of the “count_photos” was above 90, for the feature “count_videos” was almost the opposite:

They were categorized into 2 groups as “much” and “little”. I grouped rest of the features similarly. In the final stage of the feature engineering part, interaction terms were added and combined the features as follows:

The reviews were combined and called as “total_reviews”, count of photos and videos were combined and called as “total_documents”. Interaction of rating-metascore and director-stars-companies were added to the data. We added 1 as they all contain 0. We thought that the company, director and star could affect the renevue together rather than a single feature.

ML Models and Results

We split data into random train and test subsets by setting test size, 0.30. We used cross validation method (k=5) to prevent overfitting. Firstly, Linear Regression was implemented by using different libraries (statsmodels and sklearn):

RMSE (statsmodel) : 1301423702.15, R-squared : 0.79

RMSE (sklearn) : 198580875.75, R-squared : 0.52

RMSE and R-Squared statistics of Lasso and Ridge Regression were as follows:

Lasso RMSE : 169382102.38, R-squared : 0.71

Ridge RMSE : 169316040.8, R-Squared : 0.71

In linear regression, there is a distribution assumption for the dependent variable. The reason for the low R-squared value is distribution assumption and multicollinearity between variables. In the case of multicollinearity, Least Squares estimates generate high variance. Lasso and ridge regression are independent from distribution assumption and they are great techniques to overcome the multicollinearity problem. So, important variables according to Ridge and Lasso models were as follows:

Feature importance for both Lasso and Ridge models

Both main effects of star, director, company and their interaction term are important features for revenue for a movie. Count of videos and critical reviews are also effective on revenue of a movie.

Future Work

In this project, revenue had a right-skewed distribution with high variance. In order to predict more accurately, transformations can be made on the target.

Working with more variables that are related to the dependent variable. Thus, we can further explain the variance in revenue.

Performing advanced feature engineering on features.

Many thanks for reading, please visit my Github account and Tableau profile. I posted the article in my Linkedin profile. Please don’t hesisate to connect with me for your questions.

--

--