The Power of Recommendation Systems Applied to Streaming Video

Pedro Flores
Machine Learning Reply DACH
11 min readNov 21, 2022
Photo by Kelly Sikkema on Unsplash

Nowadays, every online service or app in the market tracks and stores all the interactions done in the form of data. This data is used for targeted ads, improved recommendations, or statistical analysis among many other purposes. All this collected information is generally available for any individual who requests a copy of this data.

This article will analyze the gathered data of a streaming video platform, reveal some insights about it and later use a database of movies to make some recommendations of new titles, while applying different methodologies.

The approach followed comprehends these 3 points:

I. Understanding the Data Available

Here the available files are enumerated and explained for better understanding of the data that will be used. Finding and treating missing values is also discussed.

II. Analysis and Visualization

In this section is done the analysis with help of some visualizations

III. Recommendation Systems

Finally, the different recommendation systems are described and applied to find out similar titles to what user likes to watch

1. Understanding the Data Available

Streaming Video Platforms allow active users to get their data from the account page. By clicking on the option to download personal information, anyone can send the Video Platform a request to email the data to the user. Once the request is processed, a folder with multiple content inside is received, such as IP_ADDRESSES, PAYMENT_AND_BILLING and CONTENT_INTERACTION:

Data available from streaming video paltform

Inside this last folder is the file of our interest, ViewingActivity.csv, where the data that the Streaming Video Platform has about us can be found.

Overview of information inside ViewingActivity file

As can be seen, Profile Name, Duration, Title, Device Type and Country among other categories are available. Analyzing this information, it is possible to get insights like Average Watching Time, Distribution of Viewing Time by Day or Hour and so on.

As a way to enrich the information available, Cinemagoer will be used. Cinemagoer is a Python package for retrieving and managing the data of the popular site IMDb, a movie database. This way it is feasible to also have access to data such as genre, original title, or country where the movie was produced. To access the data, the following code can be used:

ia = Cinemagoer() 
search_result = ia.search_movie(movie_title)
movie_id = search_result[0].getID()
movie = ia.get_movie(movie_id)

First, it is necessary to create an instance of the Cinemagoer class. Then, the search_movie method is used for searching the desired movie_title. Once this data is collected, it is possible to get the id of the title, and with this id, information like plot, country and year is retrieved. For instance, for the movie The Adam Project, this is the information available after using Cinemagoer:

Additional information retrieved with Cinemagoer

Finally, the MovieLens Dataset will be used for making recommendations. It describes ratings and free-text tagging activities from more than 100000 users between 1995 and 2015. Each user is represented by an id, and inside the file movies_metadata.csv is the information needed.

Data manipulation and cleaning

Now that the data available has been introduced, it is time to start transforming it as necessary. The focus is going to be on one hand in looking for missing values that may affect the analysis and recommendations, and on the other hand in anonymizing the users on the dataset.

First, an exploration will be carried out for finding missing values in ViewingActivity and MovieLens Dataset:

For columns named Attributes, Supplemental Video Type, belongs_to_collection, homepage and overview, there are some cells with no values. From them, the only information that will be used is present in overview, where the plot for every film in the MovieLens Dataset is contained. This data will be especially useful later for recommending similar titles based on their plots and thus no missing value can be found here. Therefore, rows with no data in overview will be deleted.

Regarding user anonymization, the only necessary transformation is to replace the different available users with some generic name. User will be used for this purpose:

User anonymization

2. Analysis and Visualization

Once all the information needed has been collected, it is possible to use it for the analysis. The focus will be on analyzing the Viewing Activity to understand how users use their time to consume content.

A good starting point is the distribution of viewing time, since the consumption time can be devised depending on the day of the week.

From the picture above, users tend to spend an average of less than 24 minutes (0.4 hours) per day watching content with a slight increase during the weekend, due to the availability of more free time.

Another interesting feature to investigate is the country where the platform was accessed from. This can give an idea about the residence of the users and from where they typically use the platform.

As can be seen on the picture above, most accesses to the platform are done from Spain and Germany with marginal appearance of other countries that could suggest short trips or the utilization of a VPN to use the service.

Finally, looking at the heatmap of the viewing activity gives an idea of the day, and time on each day, that User 1 is more likely to use the service. As the distribution of viewing time already depicted, weekends are when there is more use of the service but here it can also be determined that the period between 17 and 21 is the preferred to consume content on the video platform.

Sum of content viewed by user 1 in each time slot

3. Recommendation Systems

Recommendation systems are one of the most popular applications of data science. They are used to predict the “rating” that a user would give to an item, and it is employed by retail companies for product suggestions to customers, by social networks to make recommendations on pages and people to follow, and by streaming platforms to make content recommendations, among other applications.

Recommendation systems can be grouped into 4 categories: Simple Recommendations and Suggestions, Content-Based Recommendations, Collaborative Filtering and Matrix Factorization.

In the following lines the mentioned recommenders will be explained and applied to the MovieLens Dataset to get a list of recommendations with each of them for User 1

3.1 Simple Recommendations and Suggestions

It is based on the popularity of movies and the underlying concept is that titles that are most popular will have a higher chance of being liked by the average audience. Using this idea, and counting the number of reviews per title in the Dataset, the following table is obtained:

However, just because a movie has been watched by a lot of people does not necessarily mean that viewers enjoyed it. To understand how a viewer actually felt about a movie, more explicit data is useful. To do so, the average rating of each movie in the MovieLens Dataset is calculated followed by sorting the result to find the movies with the highest average rating:

The last two methods have their weaknesses. Finding the most frequently watched movies will show what has been watched, but not how people explicitly feel about it. However, finding the average of reviews has the opposite problem where there is customers’ explicit feedback, but individual preferences are skewing the data. Now is time to combine the two previous methods to find the average rating only for movies that have been reviewed more than a threshold, that will be defined as forty thousand times in this case. The titles obtained are more accurate because they are based in both popularity and feedback:

One last method inside Simple Recommendations to consider, is to calculate pairs of films that have been watched and rated together and compute all the possible combinations of them (calculate the permutations). Once this is done, the way to proceed is to determine how often these pairs of titles appear and sort this list from highest to lowest.

Since there is data available of titles watched by User 1, it is possible to make recommendations for specific titles based on the list calculated with the permutations. Applying this method to Batman Begins, the following titles are the recommended for User 1:

As can be seen, some well-known films are between the suggestions

3.2 Content-Based Recommendations

Content-based filtering uses item features to recommend other items similar to what the user likes, based on their previous actions or explicit feedback. The underlying concept is that when a user likes a particular item, he or she will also like an item that has similar attributes.

Recommender based on title plot

In this first content-based recommender, a Natural Language Processing problem will be solved. The aim here is, being picked a random title that User 1 has watched before, recommend from the MovieLens Dataset other films that have a similar plot and therefore user will probably like.

First, LanguageDetector() (from library spaCy) is going to be applied to keep only the films that have an overview in English. Then, because all films with the same title have the same plot, the rows with repeated titles can be deleted to keep a table with unique titles and plot descriptions. This way the processing time when using NLP will be lower. Once this is done, a total of unique 2769 titles are ready to be analyzed.

The cosine similarity will be applied on films plot to find alike titles to Batman Begins (User 1 watched it already). After applying it to all plots and ordering the results from higher to lower similarity score, the following list of recommendations is obtained:

As expected, all films recommended are remarkably similar to the one selected to make recommendations from.

Recommender based on Jaccard Index

In this other content-based recommender, the metric that is used to measure similarity between items is called the Jaccard similarity index. It is the ratio of attributes that two items have in common, divided by the total number of their combined attributes.

Jaccard Similarity for two sets

This score will always be between 0 and 1 and the more attributes the two items have in common, the higher the score.

To get all these similarities at once, a helpful function from the SciPy package will be called upon. Pdist (short for pairwise distance) helps find all the distances at once, using Jaccard as the metric argument. Note that pdist calculates the Jaccard distance which is a measure of how different rows are from each other. As the aim is to calculate the complement of this, the similarity, these values will be subtracted from 1.

This methodology is applied to the genre attributes of all films and after wrapping the results in a DataFrame, it is easy to find similar titles by looking up any pairings. For example, for the film The Terminator, the following results are obtained:

3.3 Collaborative Filtering

Collaborative filtering uses similarities between users and items simultaneously to provide recommendations. This allows for serendipitous recommendations; that is, collaborative filtering models can recommend an item to user A based on the interests of a similar user B.

The approach to apply this method starts by creating a table where the data about userId, titles and ratings is collected. Then center each user ratings around 0 by subtracting the average rating of each film. Once this is done, the cosine similarity is applied to the transposed matrix to find out the similarity between elements. Afterwards it is possible to build a square matrix in which each row and column represents the title of the film, and each cell contains the similarity score for each pair of titles. Therefore, the diagonal will have the value 1.

With this information it is possible to select a title, for example Jurassic Park III, sort the values in descending order and obtain a list of similar titles:

3.4 Matrix Factorization

In this last method the basic idea is to decompose the user-rating matrix into the product of two lower dimensionality matrices.

A common challenge with real-world ratings data is that most users will not have rated most items, and most items will only have been rated by a small number of users. This results in a very empty or sparse DataFrame. To solve this situation is necessary to apply a matrix factorization. Factors can be found as long as there is at least one value in every row and column. Or in other words, every user has given at least one rating, and every item has been rated at least once.

There are many ways to find the factors of a matrix, but here a technique called singular value decomposition (SVD) will be used. Before applying this, it is necessary to normalize the data by subtracting each row’s mean from each value in that row and filling in the remaining empty values with 0.

Like any matrix factorization approach, singular value decomposition finds factors for some matrix and in this case the following components are obtained:

  • U is the user matrix
  • V transpose represent the features
  • Sigma, simply a diagonal matrix which can be thought of as the weights of the latent features, or how large an impact they are calculated to have

Finally, once the three matrices are determined, it is possible to multiply them back together and add the average ratings for each row to obtain a complete matrix of ratings without missing values. And by sorting out the resulting matrix it is possible to obtain the list of titles that the user would have given to all films like depicted in the following table:

Conclusion

This blog post tried to give an overview of some common recommendation techniques used in the video streaming industry to make suggestions to users to discover new films. A basic explanation and a practical approach were also provided to understand how the results were obtained while presenting real recommendations based on what User 1 enjoyed watching.

Although recommendations were made using films, it also has applications in many fields like product recommendations, social media, home listings, music streaming and restaurants, among others.

At Machine Learning Reply, we guide and support all our customers in the development of their IT capabilities towards Machine Learning, data, or cloud Use Cases, regardless of their current phase.

References:

--

--