Applied data science: content recommendation systems

Jackson Taylor
7 min readApr 21, 2022

--

After spending the past semester learning about data science concepts in the context of linear algebra, I’ve become increasingly interested in how these mathematical methods are applied to the real world. One such use case which has widespread adoption in 2022 is content recommendation systems.

Motivations of recommendation systems

In the last 20 years, technology products have revolutionized the way that users create, discover, and consume content. Entertainment companies such as Netflix and Amazon have collected enormous film catalogs far larger than any individual user could browse effectively. As video content has become much easier to produce, social media apps such as Instagram and TikTok have been flooded with even larger video catalogs produced by their users. Once again, this leads to a much larger selection of content than any user could feasibly discover. Furthermore, as traditional businesses move to a web-first model a similar problem arises with massive online selections of physical products, digital goods, or any other discoverable content online.

This problem of extremely large datasets which users are determined to effectively browse has motivated the development recommendation systems as an application of data science. If a user logs in to YouTube who loves cat videos and another user logs in who prefers vlogs how can YouTube help each user to discover their preferred content as fast as possible?

Data science can solve this problem!

Recommendation system approaches

Data scientists have discovered a number of approaches which can be used to build modern, effective (defined as accurately predicting user behavior) content recommendation systems. Here, the focus will be on two methods: (1) content-based filtering which recommends content based on the features of that content and the users preferences for those features and (2) collaborative filtering which learns deeper patterns in the user behavior data.

Regardless of which method we choose to apply, it is important to first understand the data and its meaning before we solve the problem.

Storing data

For the purposes of content recommendation we like to represent all users and content as a weighted graph where users are connected to content based on the way they interact with it. For example, on YouTube we can represent the observed user behavior as a graph where each user is connected to each of the videos they have watched.

It is important to understand that we can choose to assign the weights of this graph in any way we like, but it will likely use a combination of implicit features and explicit features. In the previous example, the connections between users and videos they have seen represents an implicit feature because the user’s rating of the content as positive is implied by viewing it. An explicit feature on the other hand is one in which the user intentionally labels the data for us such as clicking the like button.

In practice, content recommendation systems will use many implicit features such as views, watch time, or session duration as well as several explicit features such as likes, comments, or ratings to determine the weight of connection between a user and content.

For the sake of simplicity let’s imagine a graph representing a simplified version of YouTube’s user-content graph. In this case a -1 means the user has seen the video and disliked it, a 0 means the user has seen the video and did not engage with it, and a 1 means the user has seen the video and liked it.

It should be intuitive to see that most users will not have seen or interacted with the majority of videos on YouTube, meaning the matrix representing this graph will be quite sparse. The goal of content recommendation is to predict these missing floating point values where a 1 is a video that the user is most likely to be interested in and a -1 is least interesting to this particular user.

Content-based filtering

One of the most simple methods of making these predictions is through content-based filtering. For this method we introduce a layer of features between the users and content, allowing for more possible connections and improving prediction accuracy.

For example, we can modify our previous graph to use content-based filtering by first assigning categories to each of the videos. Using categories such as “cat videos” and “vlogs” we can generate a mapping between categories and content. Some videos may explicitly be “cat videos” while some might be more of a mixture, in the case of a vlogger who shows their cats in the video. Similarly we can generate a mapping between users and categories based on each user’s preferences.

We can then use this additional layer between the users and the content to make predictions about the unobserved entries in our original matrix. To do this, we simply multiply the two matrices used for the mapping to represent our recommended content graph.

While this is a valid implementation of a recommendation system, there are many drawbacks to this approach, including: (1) large data requirements, (2) label inaccuracy and (3) not accounting for emergent patterns in the data.

In order to properly implement content-based filtering, it is necessary to first have collected data about the categories represented in content and the categorical preferences of users. The first issue that presents itself is of the scale of the data collection necessary to accomplish this. For extremely large video libraries, it may be possible for uploaders to define a broad category, but more nuanced categories would be difficult to collect data for. Furthermore, users would be required to manually enter a painfully large amount of preference data, which is inherently prone to human error as people fail to accurately describe their preferences.

The final major issue that can emerge when implementing content-based filtering is the limited number of features. If each feature is a manually labeled category, there is a natural limit of what humans are able to observe easily and spend time labeling. For most data, there will be deeper underlying patterns in the data that would be unlikely to be noticed and classified by a human.

Features vs. Latent Features

This kind of emergent pattern is called a latent feature. Typically these are discovered using machine learning methods, particularly dimensionality reducing methods. When latent features are used in recommendation systems, new methods can emerge that lead to a number of benefits.

Collaborative filtering

One such method is known as collaborative filtering. This method works similarly to content-based filtering, but rather than defining a mapping which generates the predicted graph, the observed graph is used to generate the mapping.

The matrices we generate will map users to latent-features and latent-features to content. The method for generating this mapping uses a process known as matrix factorization. Matrix factorization is a minimization problem which aims to reduce the squared difference between values in the observed matrix and the predicted matrix. The primary methods for performing matrix factorization are stochastic gradient descent, which is more tunable, and weighted alternating least squares, which is designed specifically for collaborative filtering use cases and converges faster.

There is still one problem with this method. In order to make proper predictions, the unobserved entries must be taken into account, otherwise the product of the mapping will simply generate the exact observed data. This could be solved by setting all unobserved entries to 0 and applying SVD, but this method is ineffective in practice because the observed matrix being decomposed is typically quite sparse. Instead, by using weighted matrix factorization we can add a tunable hyper parameter which assigns weights to the unobserved entries.

Not only is collaborative-filtering often a much better predictor of user behavior on content platforms, it also has the advantage of data compression. Because the latent-feature mappings generated from the matrix-factorization can accurately generate the exact observed graph, the original graph matrix no longer needs to be stored. This leads to a significant decrease in data storage requirements.

Conclusion

Content recommendation is a broad and important use case for data science. Both in its more primitive form of content-based filtering and the more accurate method of collaborative filtering, insightful data is made available to software engineers leading to an improved content discovery experience for users.

--

--