RECOMMENDER SYSTEMS: A GLOBAL OVERVIEW

Cyrine Hlioui
fifty-five | Data Science
12 min readOct 5, 2021
Photo by NASA on Unsplash

This article aims at presenting relevant guidelines for use cases about recommendation systems. We’ll first introduce the subject, before going through the different types of systems and their limits. These systems will not be detailed, but relevant articles will be mentioned for those interested. We’ll also talk about the constitution of the train/test datasets and the evaluation of these systems. Finally, we’ll address the deployment of Machine Learning (ML) models in production.

Introduction

From a business impact standpoint, recommendation systems help companies to increase their ROI by personalizing the content, based on user preferences.

Let’s take a concrete example. When you watch a video on YouTube, and you see a list of videos to watch next, that list is being built by a recommendation system. Recommendation engines are not just about suggesting products to users, they can also suggest users to products. Generally speaking, keep in mind that recommendation systems are not only about products that can be bought, as shown by Facebook friends and Instagram posts suggestions, based on recommendation systems.

To sum up, recommendation systems are about personalization. Below is a graph that resumes the different types of recommender systems which have in common the fact that they seek to predict a “preference” a user would give to an item (film, book, video, etc).

Recommender systems types

To illustrate these types, we’ll consider a business use case that we’ll solve differently depending on the recommender system considered.

Use case:

A platform wants to recommend films to logged users. In order to build a recommender system, the data science team has at its disposal raw data related to users, films and also the interactions between users and films (rated from 1 to 5).

1- Types of recommender systems

a- Content based (CB)

This type of recommendation is based on item and/or user declarative features. It is a classic supervised machine learning regression problem. Below is the input we need to prepare to run a content based recommendation system.

One observation describes a user item pair and the model should learn how to link their features in order to later predict the rating for each user on each item.

Different approaches exist to train a content based recommendation model. This article details some of them.

Pros of CB models:

  • Transparent system
  • No ‘cold-start’ problem: a new item with which no interaction has taken place can be recommended

Cons of CB models:

  • Requires a lot of domain knowledge of items
  • Serendipity: the model has limited ability to expand on the users’ existing interests

b- Collaborative filtering (CF)

In collaborative filtering, you don’t need any metadata about the items nor the users. Instead, the proposed recommendations are only based on the user-item interactions, that can be represented as a sparse matrix where each entry Aij is the feedback of user Ui on interaction Ij. It could be explicit (rating), or implicit (time spent on a video).

This approach is based on collaborative knowledge: a user will like/dislike an item that similar users like/dislike.

This article explains very well the different types of collaborative filtering approaches. In a nutshell, there are 2 main methods:

  • Memory based: no model is trained to generate recommendations. This method is based on the computation of similarities between users or between items. The closest users or items are determined only by using cosine similarity, pearson correlation coefficients, or non-parametric models as k-nearest neighbors, which are only based on arithmetic operations without parameters to learn. The recommendations for user i can be generated in two different ways:
  1. From his closest users: their popular K items will be recommended to him. (user-user based)
  2. From his favorite item: the closest items to the product he rated the highest will be recommended to him. (item-item based)
  • Model based: a model is trained in order to generate recommendations. This method is based on the matrix factorization technique which consists of decomposing the interaction matrix into a user matrix and an item matrix. These vectors can be learned using different algorithms: SVD, NMF, ALS, WALS. We can also use a deep learning model as NCF to learn these vectors.

Regardless of the method, once the latent factors are trained, the recommendations of user i are based on the computation of the dot products between his embedding and the embeddings of unseen items. Top-K items are then recommended.

For the use case introduced in the beginning, only interaction data will be used, as shown in the figure above.

Pros of CF models:

  • Serendipity: users can discover new interest unlike CB approach
  • The model only needs interaction matrix to be trained

Cons of CF models:

  • Cold start problem: if an item has not been seen in the training phase, no embedding is available for this item and therefore, it can not be proposed to a user
  • No use of metadata: its consideration can arguably increase the quality of the model

c- Hybrid

This recommendation is based on the combination of both content based and collaborative filtering methods. Nowadays, hybrid approaches are used in many large scale recommender systems. They consist in building a single model, often a neural network, that takes as input items and users features but also their representations as shown below.

Factorization Machines (FM) is a hybrid model that is widely used for personalization use cases. Its particularity is that it models the interactions up to the order d while reducing the number of parameters to learn through a low rank hypothesis.

In practice, a second order FM model is sufficient since there is not enough information to model more complex interactions. This article gives a good explanation of this approach.

Recently, architectures which fuse FM and Deep Neural Network based systems are increasingly used. DeepFM is one of them, and consists in building a neural network over the top of the FM model.

Pros of hybrid models:

  • Uses all available data
  • Models complex interaction that simple models can’t capture

Cons of hybrid models:

  • Hyperparameters tuning is time-consuming, and sometimes we have to make empirical choices
  • Low interpretability

2- Constitute correct train and test datasets

Evaluating a model is an essential task in any machine learning project.
The first step aims to constitute a train and a test sets. In general, the initial dataset is randomly splitted into 80% and 20%, representing respectively the train and test data .

The idea is to train the model, for multiple sets of hyperparameters, compute an evaluation metric on a unique test set and finally, select the set of hyperparameters that gave us the higher value. In some use cases, splitting randomly can bias the obtained results and we need to be careful when splitting the dataset. Especially in recommendation systems, where there is a temporal component, random splits are not coherent.

Let’s take an example as shown below.

The user U1 interacted with 4 products, with P4 being the last product U1 interacted with. Remember that the main objective of any machine learning model is to predict future behavior based on past behaviors. So in this case, it won’t make sense to have the interaction (U1, P1) in the test set and the interaction (U1, P4) in the train set.

So, when training a recommender model, it is important to split in a coherent way the dataset, as shown below.

Once the model is trained with the train set, we will use it to predict the ratings of the interactions in the prediction set. Based on the predictions and the test set, the model will be evaluated. The next section details this aspect.

3- Evaluate recommender system models

Evaluation is a crucial step in ML projects in order to guarantee the quality of the trained model. The choice of the metric is also important, as it can be misleading if it is badly chosen. For a very long time, the RMSE, for Root Mean Squared Error, was the metric used to evaluate recommender system models. The formula is shown below.

But, such a metric isn’t always relevant in the real world because what matters the most when using a recommender system is the order in which the products are suggested to the user and not the prediction itself. That being said, there is a correlation between a low RMSE and a relevant ranking, but there are counter examples as shown below:

As shown above, a model that has an RMSE of 0.5, which is quite a good evaluation, doesn’t necessarily give us a relevant ranking.

There is a metric widely used in this context: the precision at k. Different formulas exist. One of them consists in computing the proportion of users for whom we recommended pertinent products. In other words, let p be the precision at k. Based on the sets created previously, for each user we will:

  • Compute the scores of the interactions in the prediction set
  • Order them and take the top k, 10 for example
  • If the product, with which the user interacted in the test set, appears in the top k products to recommend to the user, then p=p+1

Finally, after looping over all users in the test set, p = p/nb_users

Now, once the evaluation metric is chosen and computed, the question is: how do I know if my precision at k is good or bad? How do I know if 30% is a significant value?
In fact, it is a question we all need to ask ourselves in any machine learning problem, and in order to respond to it, a baseline needs to be considered.

So, what is a baseline? It can be a simple model that we create, not necessarily with ML algorithms. For example, for this use case, we can say that our baseline model is the model that randomly recommends a product to a user. So, the precision at k of this baseline is the probability to recommend for each user its product in the test set, which is k/nb_products.
We can consider a more advanced baseline that recommends to users the top k popular products or even recommend to a user the top k popular product of his most seen category while excluding products already seen by the user.

It is true that among the offline evaluation metrics, the precision @k is the most common one. But, it has its limits. Actually, the main objective of a recommender system is to recommend to a user a product he doesn’t know but also a product that is interesting for him. But, evaluating such a system is quite impossible with offline metrics because we already know, in the test set, his following interaction. So, if we train an offline model (not in real-time) that perfectly predicts the following natural interactions of users, why do we even need to make recommendations? The goal is certainly to have a relatively good precision, compared to a baseline, but not necessarily to have a precision @k of 100%. The objective is not to predict perfectly natural behaviors of users because they would consume the product anyway.

Online metrics, which are much more complicated to implement, exist in order to evaluate a recommender system such as A/B testing. It will answer the question: how impactful is my model recommendation compared to a basic recommendation? An evaluation metric could be click-through rate (CTR).

There is also another limit with the mentioned metrics. They don’t measure the serendipity of the model which measures the diversity, novelty and unexpectedness. These parameters are useful for a successful long term recommendation strategy, because too much focus on the precision will lead to bored users and poor assortment.

Generally speaking, making the choice of the evaluation metric is somehow important when doing data science. Many parameters are involved, it goes from the business need to the technical choices to be made. It is not like just doing machine learning, taking a model, tuning hyperparameters and choosing the model that gave us the best performance. We need to choose the metric according to what the business needs behind.

Despite all these limits, we still use precision because it is far the best metric we have for offline evaluation. Keep in mind that it is important to compare the performance of the created model with the performance of a baseline model to prove somehow the quality of the model and justify the need to use an ML model.

4- Production ML models on GCP

The result of an ML project is putting the model into production and maintaining it. It goes without saying that building and serving production ML models is a core skill of any ML engineer. Managed services, such as GCP (Google Cloud Platform), AWS (Amazon Web Services) and Microsoft Azure, have already implemented ML systems architectures on which we can rely.

Note that in the production environment, the ML model represents only 5% of the whole system, as shown below.

Image by Google ML unit

Typically, an ML system should be composed of the following components, explained below.

Image by Google ML unit

Data ingestion: handles the ingestion of the data
Data analysis + validation: checks if the data is coherent
Data transformation: transforms the data in order to prepare it for the model
Trainer/Tuner: trains the model and tunes the hyperparameters
Model evaluation + validation: tells us whether the model is good enough to go on production environment
Serving: handles the interaction of the built model with the users
Logging: useful for debugging when errors appear

The orchestration of components is ensured by the cloud composer.

In the case of recommendation systems, the mentioned components constitute the system. Automatic refresh of the model is also implemented to take into account new products/users. Below is an overview of the system.

In the real world, the recommendation systems are used in two different ways: the batch training and the online training, which is less used because of its complexity and its difficulty to scale.
To illustrate the architecture, we assume a batch training approach. The idea is to program a regular end of the day schedule.

In more simple words, at the end of a business day, we instruct our cloud composer to:

  • Bring fresh data into our ML training dataset by sending a task to BQ (Big Query) where the latest Google Analytics data live and then have BQ run an export job to GCS (Google Cloud Storage)
  • Trigger a new training job to Cloud ML Engine to retrain our recommendation model
  • Deploy the model in App Engine to make it available as an API endpoint to whichever server that needs to make these calls. This can be our website that pulls the recommendation system and displays top 5 news articles for each user when visiting

But what if the data we need to train the model with is not available in Big Query? In this case, scheduling a periodic refresh is not necessarily relevant.
The data could be, for example, in a csv file. So, instead of a scheduler, we’ll use the cloud functions that continually watch if a csv file is uploaded in GCS. If so, they trigger the cloud composer workflow to continue the rest of the processing starting from the ingestion data into BQ to the redeployment into App Engine.

Conclusion

▹There are three types of recommender systems:

  • Content based (based on item and/or user declarative features)
  • Collaborative filtering (only based on ratings)
  • Hybrid (mix several models together, based on declarative and latent features)

▹In order to properly evaluate the recommender system model, it is important to pay attention when creating the training and the test sets as explained above.

▹Different metrics exist for evaluating a recommender system model. RMSE has been widely used but is not robust enough to conclude on the effectiveness of the model. We should rather base our evaluation on the order of the products we recommend to users.

▹The precision @k is a metric based on the order of the recommendations. But we still have to compare it to a baseline.

▹Putting a recommender system model into production is the achievement of a business requirement. Cloud providers, as Google, already worked on architectures on which we can rely. Therefore, if the data is available in the cloud, there is no need to reinvent the wheel and we only need to configure the cloud composer so that it orchestrates the different tasks.

--

--