Determining the Most Popular Movie Genre Group Within Streaming Services

Emmitt Anton
INST414: Data Science Techniques
3 min readApr 1, 2024

Not all movies are classified by a single genre, but instead contain sub-genres based on the narrative and scenes. As a result, streaming services, such as Netflix, Hulu, and Paramount+, organize their recommended and recently released shows/movies algorithm primarily on genre. However, the main algorithm is created to promote movies based on the previous movie watched. The question asked by these streaming services is what genre groups are best to promote to users when they load their services on their personal electronic devices? The answer should inform streaming services to improve their current guide algorithm to show movies from the most ideal genre group to prevent users from spending time scrolling through movies.

Network data on movies with their respective main and side genres, ratings, gross value, and actors can provide a group of movie genre clusters that will narrow the list of movies down to small groups of movies with popular genres to display first to users upon loading into a streaming service. This data is collected through Kaggle and contains movies up to 2022.

Between the main, sub genres and movie titles, the following code utilizes K-Means clustering, from SKLearn, in which a model is created to represent a cluster of each genre from the dataset and a data frame to approximate how many movies are in a particular cluster of movie genres. By implementing the SKLearn software in the code, the fit method, in addition to the “groupby” method, learns the different genre and movie combinations from the dataset’s training data and returns an array of integer labels of movies that represents for each genre.

Based on the primary movie genres, action, comedy, drama, western, thriller, horror, fantasy, mystery, romance, sports, adventure, sci-fi, which is why K is set to the 13 for the number of main genres within movies. Within each cluster from the dataset, I believe that each cluster represents one of the main genres mentioned earlier, such as action and comedy, with movies the top movies that fit within that cluster. For example, in the thriller cluster, cluster 0, contains the movies The Whole Truth and Nightcrawler.

Popular Movie Genres and their box office value from 2022

As a result from the data collected, the genre cluster with the most movies from the dataset is the thriller genre in which users on Netflix, Hulu, Paramount+ watch more thriller movies on their platforms. By collecting data on both the main and sub genres of each movie from the dataset, it provides a useful quantitative count of how many movies contain a specific genre.

Categories of Movie/TV Show Genres from different Streaming Services

The dataset is missing an ID for each movie to be identified in the cluster model. Also, as the final output displays only 10 clusters, there is potential bias as a value error, displays while running the 10th cluster, for not being able to take a larger sample than population when replace = false. There is another bias in the dataset as new streaming service exclusive movies become available every month, the popularity of newly released movies in comparison to the movies that have spent time in the catalog would result in an unbalanced ratio.

Movies with multiple genre combinations that are popular

Bugs other may encounter include key errors with the movie title columns as there are times each movie could be missing from creating the matrix and the cluster model. To prevent this, I used a for loop, using the movie title column, to ensure each movie will be read from the dataset and transferred into the matrix. Also, there is a bug where if there is no else clause in the movie title lists inside each cluster, then there will be another key error regarding the movie title.

--

--