Recommendation System — Basic Concepts

Tuan Nguyen
The Startup
Published in
10 min readFeb 7, 2021

Introduction

As a netizen, probably you have met the following facts many times:

  • Youtube automatically plays clips related to the one that you are watching. And especially when you are at its homepage, Youtube also automatically suggests clips that you find interesting.
  • Have you ever noticed the section [Frequently bought together] while doing shopping on Amazon?
  • Have you ever wondered that Facebook is secretly listening to your conversation and display the ads relevant to your favorite items even though they are only in your mind? Or friend suggestions?
  • Netflix suggests movies to its users.

There are actually many examples of systems that are capable of automatically suggesting to users the items that they may be interested in. Thanks to such systems, user-oriented advertisement campaigns have been showing their significant advantage in business marketing. The background algorithms adopted in those systems are actually Machine Learning algorithms named Recommender Systems or Recommendation Systems.

The recommendation system is an application case of Machine Learning and recently has received great attention from the community as the Internet has been exploding for past10–15 years. There are basically 2 main entities in an RS, i.e. users and items. Users are users and items are things like movies, songs, books, video clips, or other users in friend suggestion case. The main objective of RSes is to figure out the interest of the user at an item so as to come up with an efficient recommendation strategy.

Long Tail in Commerce

Let’s consider the basic difference between a store in reality and an online store from the perspective of choosing items to advertise.

Probably you might hear about the Pareto principle (a.k.a 20/80 principle): for many outcomes, roughly 80% of consequences come from 20% of the causes. Most of what we use in daily conversation comes from a subset of words in the dictionary. Most of the properties are owned by some individuals. Same in commerce, the top-selling products are a small portion of the whole store.

In reality, the stores are typically organized into 2 areas — one as a showroom, one for stock. The basic strategy to achieve a high revenue is to display the most popular products in front to grab the shopper’s eyes and encourage them to pick up while keeping the less popular ones in stock. A drawback of this strategy is that those popular products, i.e. high-end smartphones, might not suit a particular customer, i.e. senior people. The store has what a customer is looking for but not has it sold just because the customer did not see it on the shelf. One more thing, due to the limit of physical space, it is impossible to show the products of all lines with just a few for each line. Here you can see that most of the revenue (80%) comes from a small portion of popular products (20%). If you list all the products in order based on the sales from high to low, you will see some products creating a large revenue and a long tail list of products with a small contribution. This is known as the long-tail phenomenon — the long tail of less popular products.

These limits can be avoidable by online stores. With an unlimited online showcase area, all the products can be displayed. Bringing the right products to customers is facilitated when doing online with a very low (~0) cost and as a result, increasing the revenue.

Recommendation System Classification

Building a recommendation system can be done in various ways depending on which technologies are adopted. In general, these systems can be classified into two groups.

  • Content-based systems that suggest items based on the items’ properties. For instance, if a user usually watches fantasy drama television series like Game of Thrones, The Lord of the Rings, then streaming services like Netflix, Amazon Prime Video should recommend the same types of series like The Witcher, Stranger Things instead of Breaking Bad or Prison Breaking. This requires classifying items into separate groups based on their specific features (or properties, attributes). Of course, there are exceptional cases in which the items cannot be put in any groups, and therefore classifying them is a challenging task.
  • Collaborative filtering systems that suggest items based on the similarity between users and/or items. The idea of this approach is to recommend an item to a user by taking other users with similar behaviors into account. For instance, the system somehow knows users A, B are members of Taylor Swift’s fan club but has no idea whether or not user C is in that group. So based on the fact that users A, B are “similar” to user C in the sense that they are all big fans of Micheal Jackson, there is a possibility that user C also likes Taylor Swift.

In this article, I will focus on the first group, i.e. Content-based systems. The other will be mentioned in the next one.

Utility matrix

Examples

As mentioned earlier, there are two entities in Recommendation Systems, i.e. users and items. Each user owns a degree of preference on each item. This degree, if known, is represented by a value corresponding with a pair of user-item. The value is also called rating based on the assumption that it can be measured via the event triggered when a user rates an item. All the ratings between users and items including the unknown ones (which need to be figured out) construct a so-called utility matrix.

Figure 1: Example of utility matrix for song recommendation system. Songs are rated from 0 to 5 stars. The grey cells are the unknown data and need to be filled by the recommendation system.

As explained by the illustrative figure, a recommendation system is responsible for filling incomplete matrix cells and then provides users with recommendations. This is sometimes called the matrix completion problem.

In the example, the first three songs are of dancing pop while the others are of Alternative rock. From this data, we can see that A, B like pop songs, and C, D, E, F are favorite rock songs. The system should suggest Rain on me to B, Gangnam style to A, and Willow to D, E, F. Assuming there are only 2 genres and given a new song has just released, we just classify it into one genre and recommend it to each user.

Typically, there are a number of users and items in a system, and each user rates quite a few items or even never rates (for these ones, the best way is to recommend the most popular items). As a result, there are a large number of grey cells in the utility matrix (obviously a small number of the already filled cells).

As more cells are filled out, the more precise the recommendation can be performed. Therefore, the systems always ask users about their interest in the products and try to have users rate the products as many as possible. By doing so, it helps not only other users to be aware of product quality but also the system to know of user’s interest and come up with efficient ads policies.

Build a Utility Matrix

Utility matrix plays a core role in the recommendation systems in order for them to make recommendation decisions to users apart from providing the most popular ones. However, it is challenging to construct this matrix in practice. There are two methods to determine rating values for each pair of user-item in the utility matrix.

  • Ask users to rate items. Amazon, Walmart, and many other systems keep asking users to rate their products by sending emails to remind them. This straightforward approach has a disadvantage in that users rarely rate items if they did not choose to do it from the beginning. And even if they do latter, those are highly biased reviews.
  • Based on user behaviors. If a user buys an item on Amazon, watches a clip on Youtube (a couple of times perhaps), or does some searches to compare the price, then we have glue to say that the user is highly interested in that item. Facebook also relies on what you like to decide which content should be placed on your newsfeed. Because the more time you spend on this social network platform, the more it can benefit from, Facebook always finds a way to bring useful information to attract you. In this way, the utility matrix is composed of cells 0 and 1 where 1 means the user likes an item, and 0 just indicates that it does not have any information yet. Note that in this case, 0 does not mean to be comparable to 1 but the fact that the user has yet to provide any glue about his interest, that’s it, no more. Of course, there are various options to enrich the matrix data, i.e. with values greater than 1 based on the experience time or the number of times the user spends on the item. It is not sure at the background, how Facebook leverages the data provided every time the user chooses to dislike something, but straightforwardly, we can think of some values like -1.

Content-based recommendation

Item profile construction

The systems following this approach usually try to build a profile for each item, which can be mathematically represented as a vector of features. In simple cases, this vector can be directly extracted from the item. Let’s look at some data that can be used in a music recommendation system:

  • Singer: For Despacito, some like the original track by Luis Fonsi with Daddy Yankee while the others might like Justin Bieber.
  • Songwriter: The pop listener can choose either Taylor Swift or Katy Perry
  • Year: Old songs like Thriller, Baby one more time vs modern ones like Despacito, Rain on me.
  • Genre: Pop, Rock, Classic, etc.

Apart from Genre which can be challenging sometimes to classify a song, many other features can be used.

Figure 2: Example of feature vectors for items at the last column. For each user, we need to find an optimal corresponding model θi

In Figure 1, we simplify the problem by constructing a 2-dimensional feature vector for each song: Dance pop and Alternative rock. The dataset is composed of 5 songs. Assuming that five feature vectors,

corresponding to the five songs are pre-defined (somehow) as given in Figure 2. Similarly, a user behavior, i.e. rating a certain song, is represented as the set θ. Input data to construct user behavior’s model θ_u is basically a pair (item profile, ratings) for each item rated by the user. Filling incomplete data of utility matrix means to predict the interests of the user according to the model θ_u. The output is actually a function whose form varies by different problems.

Loss function

We define:

  • N, M as the sets of users
  • Profile matrix
  • Utility matrix where the cell at m-th row and n-th column shows the interest, i.e. the number of rating stars, of the user n on the item m that the system has collected. There are many missing cells in this matrix that need to be filled by the system.
  • Binary matrix rated or not R indicating whether a user has rated an item, i.e. 1, or not, i.e. 0.

Assuming that we can find a model for each user in terms of the following column vector

and bias

to represent the interest of a user on an item by linear regression function:

The input data (used to train the model) of the user n is composed of all the filled cells of the n-th column of the utility matrix (y_n of Y), the linear regression with l_2 regularization as follows:

where the second component is regularization, λ is a positive number, s_n is the number of rated items or the sums of the n-th column of the binary matrix rated or not R, or

Note that the regularization is typically not applied to the bias b_n.

Since the above loss function depends only on the rated items, we can have it reduced into the sub-vector

constructed by extracting the filled elements of the matrix Y’s n-column, i.e. y_n. We also set the X’s sub-matrix with the rated items by the user n (a bit unclear? no worry, you will get it with the provided example), given as

Now we have the reduced version of the loss function of the user n’s model:

where e_n is the column identity vector. This is basically the loss function of ridge regression with the solution, i.e.

that can be found by using gradient descent. Note that the vector w_n is only determined if there is at least one item rated by the user n.

Example of loss function for user E

Back to the example in Figure 2, the feature matrix for items, i.e. songs, where each column corresponds to an item, is

For user E with n = 5,

Since user E only rates the first and fourth items, we have

The loss function with the corresponding coefficients for user E is given as

In the next article, I will explain how to apply all the things into a movie recommendation system with the MovieLens 100k dataset.

Acknowledge

I would like to send my big thanks to Vu Huu Tiep for the permission to translate his post.

--

--