The Art of Fabulous Recommendations: Recommender Systems

Co-Authored by Liam Manatt

From Netflix to Facebook, nearly every online platform we use today employs some form of a recommender system, which determines what an individual may like based on location, trends, and most importantly their past interactions. One of the most obvious uses of this is media recommendations. Netflix for example uses a recommender system to provide a similarity percentage for each individual.

Figure 1: Netflix Recommender Example

As seen in the figure above, Netflix’s recommendation algorithm puts BirdBox at a 97% match with the user. This indicates that Netflix finds the movie to be a great match for the user. Netflix and other online platforms have been using recommender systems since the early 2000s, but the search for better recommender systems was sparked in 2006 when Netflix put out a One Million dollar bounty on a 10% better algorithm than their “CineMatch” algorithm. With updated algorithms, recommender systems began to use data from other users to recommend media to particular users.

Data

Data associated with recommender systems is generally found as either a bipartite, weighted graph or as the corresponding adjacency matrix. An edge between a user and a movie exists if the user has rated the movie. The weight of an edge represents how high the rating was. See the below bipartite graph with edge weight shown as the width of the line connecting a user and a movie (a thinner, lighter line represents a lower rating):

Figure 2: Bipartite Recommender Graph Example

Though it may be hard to decipher at first glance, this bipartite graph is incomplete. That is, each node on the left is not connected to each node on the right. In essence, each user did not assign a rating to each movie. In the real world, such graphs are pretty sparse because websites like Netflix maintain a huge catalog of movies and we cannot expect each user to watch every movie. This incompleteness can be more easily seen in the above graph’s adjacency matrix:

Figure 3: Adjacency Matrix Corresponding to Figure 2

The highlighted ?’s show the movies that a user didn’t rate. For example, user B did not rate movie 1. We can also more clearly quantify each rating; for example, user E rated movie 6 a 5.

The biggest weakness with these representations is that they take a lot of memory. For example, in a more sparse example, many of the entries in the corresponding adjacency list will be empty (?’s in this case). If there are u users and m movies in the dataset, we would have to allocate a um matrix (or um entries) with a lot of the entries being empty. From a graphical perspective, a complete bipartite graph will have um edges but practically, a recommender systems graph will have << um edges. For example, say a small streaming service has 100 users and 100 movies. If each user watches, on average, 3 movies, our adjacency matrix will have 300 cells that are non-empty compared to 9,700 empty cells. Each user would have to watch around 50 movies for the matrix to be even half-full. There are more space-efficient manners in which to store such data but that is a different conversation.

Content Filtering

The most obvious approach for a recommender system would be Content Filtering. In content filtering, the algorithm uses rating dating from the user along with attribute data from the movie to find the similarity of the movie to the user’s preferences. In its most basic form, the user’s data comes from the user’s previous behavior along or from a preferences survey. With a preferences survey, This data will be represented as a vector, with each value in the vector holding a value from, say 0–5, of how much the user enjoys a particular attribute of a movie. This vector is then multiplied by a generated movie attribute vector to create a dot product of the preferences and attributes that represents the similarity of the movie to the user. However, this example is unrealistic, as it would be near impossible to get user’s to fill out a lengthy enough survey to have an accurate algorithm. The solution to this is to use the user’s movie ratings and the movie’s attribute vector to work back to the estimated preference vector. To achieve this the algorithm uses a user movie rating matrix. With u users and m movies, the matrix would be m x u. The figure below shows an example matrix.

Figure 4: Content Based Recommender Systems Example

In this simple example, the algorithm assumes there are only 2 attributes for every movie, shown on the right. The matrix on the right is then m x 2. The algorithm assumes the value at muⱼis the dot product of u preference vector and mᵢ attribute vector. The objective of the algorithm is to learn the preference vector for every user. We can measure the error of each preference vector via the Mean Squared Error (MSE) formula. For MSE, we must define a few more variables. Let r(i,j) = 1 if user j has rated movie i, Y(i, j) = Rating of user j on move i, U be the parameter vector, M be the movie feature vector. Thus the prediction would be (U⁽ʲ⁾)ᵀ(Mᵢ), and we define z to be number of movies rated. We can find a minima for the MSE using a gradient descent update. This update would then be:

Once this descent has converged, we will have found an optimal set of parameters for each user. However, this is an antiquated system and has a number of flaws that will be discuss in the next section.

Collaborative Filtering

A major concern in the content filtering approach, as mentioned before, is how in order to generate predictions we must have data regarding the attributes of each movie and the specific preferences that a user has for attributes of a fathom movie. Obtaining some attributes of a movie (e.g. the degree to which it is comedy, drama, action, etc.) can be extremely tedious because it requires someone to watch the movie from start to finish. This has to be done for each movie in the dataset. Similarly, measuring movie preferences of a user requires that person to fill out a form regarding what type of movies they like and don’t. This process is largely arbitrary as, in many cases, human behavior is unquantifiable.

Thus, collaborative filtering starts with what is directly quantifiable — the ratings that a user gives a movie once they have finished watching it. The goal of collaborative filtering, similar to content filtering, is to predict the ratings for movies that a user did not watch. In short, the approach behind this method is to reverse-engineer the data for the user-movie preferences and the attributes of each movie. This way the only data that is needed is the ratings that a user gave a movie that they actually watched, which, in nature, is simple to collect.

Reconsider the ratings data example from before. Here it is again:

Figure 3 (Repeat)

Recall that in content filtering, we had an n ⨉ t matrix P and a t ⨉ m matrix A where n is the number of users, m is the number of movies, and t is the number of attributes/characteristics for each movie. P contained the preferences of each user and A contained the attributes of each movie. Due to the nature of content filtering, these matrices were given as parameters to the algorithms as we already knew this data. The matrix R = P ⨉ A contained the estimated rating that each user would give each movie. However, in collaborative filtering, we do not have P and A; we want to reverse-engineer them from R.

Figure 5: Basis of Collaborative Filtering

Side Note: Notice how the columns/rows in P and A respectively are not named drama and action anymore (instead they are P / P and A / A ). This is because we are now longer defining what exactly P and A are measuring. The resulting columns/rows and the values in each are arbitrary features found through the mathematical properties of matrix R.

Formally, the mathematical problem statement is:

Given an m ⨉ n matrix R, find the best possible n ⨉ t matrix P and the best possible t ⨉ m matrix A such that PA best estimates R.

Due to the nature of the problem statement, collaborative filtering can just be seen as an optimization problem. We have to define “best estimates”. We can define a mean squared loss function that computes the difference between an element (PA)ᵢⱼ and the corresponding Rᵢⱼ. Define the corresponding MSE:

In any optimization problem, after a loss function is defined, the next step is to minimize it with respect to the desired parameters. A typical way in machine learning to minimize a loss function is gradient descent, which will iteratively try to converge to a minimum of a given function. First, initialize both P and A to contain random values. Gradient descent will update each element in the matrices A and P as follows: (aᵢⱼᵏ represents the element at row i, column j of A at iteration k in gradient descent)

where 𝛼 is the gradient descent learning parameter

Once all the values in A and P have converged, assuming that they do converge, we have found the matrices whose product most closely resembles our rating data. Call these matrices A* and P*. Now, there are two options: 1. Take the matrix product (P*A*) and completely reconstruct the matrix R with all new values defined by the matrix product or 2. Use the matrix product (P*A*) to only fill in the rating values that were missing in the original R matrix; in other words, all the ratings in R remain unchanged and the missing ones are filled in from the results of the algorithm.

Conclusion

The processes of content and collaborative filtering give a baseline process for the direction to which a recommender system should follow. They both make it so a software can generate missing ratings given appropriate, similar data but differ in where they start and the assumptions/data they utilize. However, systems like the ones used in Netflix and YouTube are far more complex than the procedures outlined in this article, mainly due to the vast amount of data they work with. For example, matrices and their multiplication are pretty time and space intensive, so companies must develop optimizations that make their sites both quick and user-friendly. More can be found here. In spite of this, understanding the basis of these intricate algorithms makes it more clear on how such software is deployed and the necessary steps they take in achieving their ultimate recommendations.

--

--