Sitemap

Streaming Platforms and Movies Analysis

13 min readJun 12, 2022

--

By Qiyue Chen, Mehul Kotadia, Yi Huang, Hyunmin Kim, Aisha Saleem

Introduction

The increasing bandwidth of broadband networks has created opportunities for OTT services to enter and erode traditional broadcasting markets. In the United States, subscriptions to traditional cable television services have shrunk to 94 million households in 2018 (or 74% of the estimated 127 million US households) (Spangler, 2018). This trend is especially strong among the younger generation (18–34), who are much more likely to opt for alternative video delivery services.

The aim of this project on one hand is to aid viewers in selecting the appropriate streaming platform from among Netflix, Hulu, Prime video and Disney based on the quality of the content on the platform using visualization techniques. On the other end, we also build an ML model using attributes like movie budget, genre, vote counts etc. to predict the rating of the movie. It helps streaming platforms to invest in producing or acquiring appropriate content for their platform.

This project is based on two datasets that are retrieved from Kaggle. The first dataset contains basic movie information across the following 4 platforms: Netflix, Disney, Hulu, and Prime Video. The second dataset contains general movie metadata. These two datasets were explored and analyzed individually. This analysis is divided into two parts — customer-based and platform-based analysis. The analysis of the first dataset is designed to demonstrate which streaming platform is worth subscribing to and illustrate customer classifications of each platform. The second dataset is used to analyze and help streaming platforms decide what types of movies should be displayed to increase customer base and retention.

Customer-Based Analysis

Exploratory Data Analysis

Movie streaming dataset is used to determine different types of movies available on the movie streaming platforms. Then, the study is used to determine which platform is worth subscribing to and the customer segmentation of each platform. The 4 streaming platforms being analyzed are Netflix, Amazon, Prime Video, Disney, and Hulu. There are 9515 records and 14 attributes in this dataset. The attributes available in this dataset are displayed below:

The unnamed and the ID column are not needed in this study. According to the shape of the Type attribute, the entire column is filled with zeros, hence, there is no relevant information being generated. Thus, these 3 columns are dropped. The values for IMDb and Rotten Tomatoes are stored as strings in the format of ‘7.8/10’. Only the numerator is needed which was extracted and converted to float data type. The denominator for each attribute is replaced with an empty string, and then the data is converted to numeric. Below is the code to do this:

#Turn the rating into numeric
# cust is the name of the dataframe
#IMDbcust['IMDb'] = cust['IMDb'].str.replace("/10", "")cust['IMDb'] = pd.to_numeric(cust["IMDb"])#Rotten Tomatoescust['Rotten Tomatoes'] = cust['Rotten Tomatoes'].str.replace("/100", "")cust['Rotten Tomatoes'] = pd.to_numeric(cust["Rotten Tomatoes"])

Below is the missing percentage calculated and visualized using matplotlib bar graph:

“Age” attribute has significantly more missing values than other attributes, hence the “Age” attribute is discarded. In addition, “Directors”, “Language” and “Country” attributes are dropped because they are not needed in this analysis. Using the Runtime, IMDb, Genres, and Rotten Tomatoes attributes, 4 different data frames are created for each of the 4 platforms to easily perform analysis.

Visualization

Firstly, the team uses a Venn diagram to visualize how many movies are unique to each platform and how many overlap to give a perspective on content uniqueness and help customers choose. For this part, Hulu and Disney are combined as one since there are not that many movies available in the dataset for each of these two platforms. In addition, Hulu and Disney have a subscription plan together so they are combined as one.

Amazon prime has the highest number of movie titles 4113, followed by Netflix 3692. Disney and Hulu together have 1963 movies. According to this dataset, there are 3901 that are unique to Prime Video, 3550 movies that are unique to Netflix, and only 1816 that are unique to Hulu and Disney combined. This shows that Netflix has a wider range of movies to choose from as well as Prime Video.

Looking at the overlaps, Prime Video and Netflix only have 101 movies in common so if a customer is looking for a wide variety of movies to choose from, Netflix and Prime Video will be worth subscribing to. Hulu+Disney and Netflix, have only 36 movies in common while 103 movies are common to Prime Video and Hulu+Disney. So if a customer is subscribed to Hulu+Disney, they might want to subscribe to Netflix to limit the number of overlaps.

The graph above shows the counts of recent movies on each platform. It clearly indicates that Netflix has the highest count of recent movies and Prime Video comes next. Prime Video has the highest number of movies over the years. Disney has the least number of movies in 2019 and 2020.

The graphs below indicate the movie genres distribution across each platform. There are 27 unique genres available in the entire dataset.

Amazon prime has close to 1900 drama movies, which is again the highest among the 4 platforms. It has the lowest number of mystery movies. Netflix has 1600+ drama and 1400+ comedy titles. Genres like mystery, family, adventure are the occur the least in the respective order. Drama and comedy have the highest count for Hulu, however, Sci-fic, crime, and adventure are not that popular on Hulu. Disney has been able to differentiate its content; it has the highest number of family movies. Considering the popularity of Disney among kids, it is not surprising. It has fewer action movies.

The Word Cloud shown above displays the most frequently used words in the movie titles among all the 4 platforms. The team also created individual Word Clouds for each platform. For more information, please refer to the Appendix.

Analyzing the individual word clouds, Love, Christmas, Girl, Story, Day, Life, and Man are the popular words on Netflix, Hulu, and Prime Video. Star Wars, Shark, Story are some words that are more popularly used in movie titles on Disney. Mickey, Adventure, Dog also appear in a significant number of titles. Disney seem to have distinguished itself from other players in terms of content, as it has its own niche segment to which it caters — Kids.

Movie rating is one of the important attributes. More movies with higher rating is a good trigger for customers to subscribe to a particular platform. The density plot shows the rating distribution of the 4 platforms:

Disney has the most movies with a rating around 7.5, which is considered to be above average. Considering the more number of movies on Netflix and Amazon prime it is bound to have a lower distribution.

This graph shows the rating distribution for the 4 platforms from Rotten Tomatoes:

Rotten tomatoes also support the above theory with Disney and Hulu having more movies with higher ratings compared to Netflix and prime.

From the above visualizations Amazon prime and Netflix emerge as better choices for customers to subscribe to. It is worth mentioning that customers having preference of watching family content can also look to subscribe to Disney.

Platform-Based Analysis

Exploratory Data Analysis

The team does a platform-based analysis on all streaming platforms rather than analyzing platforms individually. This analysis aims to build an efficient and well-performing algorithm that can be utilized by streaming platforms in deciding what movies should be added or produced to increase customer attraction and retention. A classification model is practiced to predict the rating class of the movie as either 0 (poorly rated) or as 1 (highly rated), and these platform companies can decide whether to add a movie or not based on the results.

This segment deals with the movie meta dataset. The movie meta dataset contains 45,466 rows and 24 columns. Below are the available attributes of the dataset:

Having 24 attributes for analysis is a bit overwhelming so the dataset is simplified by discarding certain attributes. Below are the attributes and their description that are necessary for the analysis:

In addition to the attributes above, a new column for the return on investment is created using the following formula: roi = revenue/budget for each instance.

Next, the data is cleaned before feeding it to the models. According to the revenue column, there are 38052 missing values and the budget column contains 36573 null values. A subset of the data is created for where the roi column has no null values, leaving a total of 5381 rows for the analysis. This new dataset contained 4157 null values in the belong_to_collection column. Since the instances that are not missing show that a particular movie belongs to a collection, and rows where values are missing indicate that the movie does not belong to a collection, the present values are set to be 1 and missing values to be 0. After these modifications, the dataset contains no missing values.

The genre column looks like this:

Each instance contains a list of dictionaries for each genre. Each instance is converted to a plain list containing the genres for that particular movie instance. Then, the team creates dummy variables for each unique genre present in the dataset and drops the original genres column.

Here are the updated columns in the dataset:

After the roi column, all the columns are the different genres. However, there are some unusual genres such as ‘Carousel Productions’, ‘GoHands’, which are not popular. The sum of each column is computed which makes it easy to know the number of times 1 appears for the genre dummy variables. The genre columns where 1 occurs less than 10 times are dropped. The attributes reduced from 39 to 26 with 5381 rows. Finally, the dataset is ready to proceed forward with modeling.

Modeling

Before applying the classifier, the dataset is divided into 2 data frames. The first one contains only the following continuous attributes: budget, revenue, belongs_to_collection, runtime, vote_count, roi, and vote_average. Then a simple linear regression model is run on the first dataset and attributes are classified to be significant with a p-value below the alpha = 0.05 level. Genres are not included in this model because all the basic genres that a movie could fall into are to be used in the final classification model. Even if the model detects ‘adventure’ as not a significant attribute, it cannot be easily discarded since the platform might want to see if a movie including ‘adventure’ with other genres is highly rated or poorly rated. The second dataset contains all the genre columns plus the significant continuous variables from the results of the linear regression model.

Linear Regression

Below are the results of running linear regression on the following attributes:

With an alpha value of 0.05, revenue and roi are not significant in predicting the rating of a movie. This means that the streaming companies do not have to be concerned with the revenue and roi of a movie in order to predict the rating. The budget of the movie is significant because a higher budget, in most scenarios, means better quality content, which is highly related to the ratings.

Pre-processing

To proceed with the logistic regression model, revenue and roi are dropped. Then the vote_average column is converted to 0s and 1s where 0 is for vote_averages equal to or below 7, and 1 is for values above 7. Next, the team discretizes the budget, runtime, and vote_count columns using pandas quantile-based discretization function called q-cut. Each attribute has 4 bins with an approximately equal number of instances.

Below are screenshots of the before and after view of discretized columns:

Next, these intervals are converted to values that can be used by the algorithm. Each bin is assigned to an ordinal label using sklearn’s preprocessing tool: LabelEncoder. We used an ordinal labeling technique as the order of these bins matter (e.g for runtime, 60 mins come before 80 mins and so on). Since there are 4 bins created for each of these attributes, values of 0–3 are given to each bin value for each attribute. Below are the labels corresponding to each bin value for each attribute:

Each label corresponds to a bin ‘category’. This process facilitates applying the classifier and also interpreting the results to the streaming companies. For instance, budget category 2 refers to any budget value between 17 and 40 million dollars. If the classifier predicts movies with a runtime label of 3 as highly rated, the platform can focus on adding more movies with a runtime longer than 120 minutes.

Logistic Regression

Benchmark

The best model is selected on a performance basis. The evaluation metrics used are precision, recall, and f1-score. Running the logistic regression on the pre-processed dataset, below is the model result:

Even though the accuracy appears to be good enough, the recall and the f1-score are poor for the highly-rated class. Based on the confusion matrix, the main metric is the recall or the stratified accuracy of the minority class which is calculated: Tp/Tp+FN. The stratified accuracy for the minority class is 38% which is undoubtedly bad. This is reasonable because there is a class bias towards the majority class.

Here is the class distribution for the response variable:

There are 4322 instances for class 0 and only 1058 instances for class 1. A significant class imbalance occurs hence the poor model results. To improve model performance, the team balances the class weights by using logistic regression’s parameter: class_weights.

Balanced Class Weights

Logistic regression has an inbuilt parameter called class weights. The keyword ‘balanced’ is passed to the parameter to balance the weights of the classes. The model automatically distributes more weight to the minority class and puts less weight on the majority class to reduce bias.

Here are the results after balancing the class weights:

Even though the accuracy drops, the recall significantly improves and the f1-score improves slightly although the precision decreases. The stratified accuracy for the class has increased to 80% which is a great improvement! Next, the team added manual weights using a grid search to see if the model can be further improved upon.

Manual Weights + Grid Search

Grid search is used to optimize model hyperparameters to improve model performance. It uses a combination of hyperparameters and generates a model with the best results.

Here are the model results after adding the manual weights:

Unfortunately, this approach yielded a much worse model result compared to the previous approach. The results are similar to the benchmark results. So far using the logistic regression’s class weight parameter has yielded the best model results based on the stratified accuracy.

Next, the team applies Synthetic Minority Oversampling Technique (SMOTE) from python’s Imbalanced-learn library.

SMOTE

After applying SMOTE, both classes have the same number of instances, which is 4322 instances per class. Here are the model results:

Precision, recall, and f1-score have increased significantly. Even though the overall accuracy dropped, this can be concluded as the best-performing model. This is a model that can be utilized by streaming companies to predict a new movie instance as highly rated or not.

Predicting a new movie instance

Let us say a movie falls into the budget category of 3, runtime category of 1, vote_count category of 2, and has the following genres: comedy, romance, action.

Feeding the above movie instance to the model, the algorithm predicts the instance as 0 meaning not highly rated. This shows that a movie with the attribute values as above will not be predicted high by our model, hence, the streaming companies should decide whether to add the movie or not.

Conclusion

  • What platforms are worth subscribing to in general and also what types of customers are each platform built for

After performing the customer-based analysis, there are several practical findings for the streaming platforms. Netflix and Prime Video have obviously more movies than Hulu and Disney. As a result, if customers choose movies on a quantity basis, Netflix and Prime Video are preferred. However, based on the genre distribution, each platform has different specialties. If customers are interested in movies for children, Disney is certainly a good choice. The other 3 platforms have similar genre distribution, so the final choice on these platforms can be decided by a specific movie that only a platform can provide.

  • What movies should streaming platforms pay attention to, including movies that are highly rated, use algorithms to learn what features/genres relate to rating

Reviewing the results for the classification algorithm, the best performing model is the logistic regression classifier applied on the oversampled data with precision of 77%, recall of 81%, and f1-score of 79%. This is the optimal performance of the algorithm, hence, for the purpose of this project, this is the final model. Features such as revenue and roi of a movie are irrelevant in predicting if a movie will be highly rated. Features such as the budget, runtime, vote count and genre are significant in predicting if a movie will be highly rated or not.

--

--

Responses (1)