Deep learning for recommender systems

How we improve vehicle recommendations at mobile.de

Published in

Berlin Tech Blog (by Kleinanzeigen and mobile.de)

12 min readFeb 5, 2018

Finding a car that fits your preferences can be a very time-consuming task and may drive you crazy. On the other hand, with approximately 1.5 million cars on our platform, vehicle descriptions that are constantly changing and users that are still exploring may also drive us as the solution provider crazy. Under these circumstances, matching cars with users’ preferences is challenging and by the time you’ve found a match, your perfect car might be gone.

Sample recommendations from Amazon, Netflix, and Spotify

Even if you haven’t searched for a car yet, you’ve probably faced similar problems in other domains like news, consumer products, or entertainment. On Spotify, you don’t like to explicitly outline your music tastes before you start enjoying music. You also don’t want to search through the whole IMDB to find your next favorite series on Netflix. The sheer number of possibilities creates a burden of choice. We simply can’t manually collect all information before making decisions. We want decision-making support that is smooth, personalized and in particular we expect it to be automatic.

People tend to blame Information Overload, which describes the large amount of information invading our minds and demanding fast and accurate processing. But there is another perspective. In 2008, Clay Shirky, a professor at New York University, said: “It’s not Information Overload, it’s Filter Failure.” And at mobile.de this highlights what we’re trying to achieve: improve our filters to save you from information overload and help you to find the right car quickly — and without any hassle.

To do that we use a Recommender System that infers user preferences from billions of interactions and hundreds of features to come up with personalized recommendations. The figure below shows this reasoning: we address the perceived information overload problem as a by-product of digitization by using techniques that became powerful through digitization.

A bit of background

Before we dive into the details, we need to set the stage and clarify some basic vocabulary:

The three basic data sources for a recommender system are users, items, and the interactions among them. Interactions can be either implicit or explicit. Implicit interactions occur without the intent of expressing preference or disregard, whereas explicit interactions reflect preference on purpose. For example, clicking on specific vehicle ads or bookmarking them are mainly implicit signals as the user’s main intent was not to rate them, for example like putting stars on a product bought on Amazon. However, we can safely assume that you have a strong preference for a car whose dealer you contact. We store these interactions between a set of users U and a set of items I in a rating matrix R as shown below. This matrix has m rows for users and n columns for items. Each entry (i, j) contains the specific interaction. Here we distinguish between views, bookmarks and mailings.

Depending on the type of feedback these matrices are unary, meaning that we have just a single positive value that denotes feedback, such as “1”. All unobserved feedback leaves entries empty. We used to denote the known matrix entries as ratings. In addition, these matrices are usually very sparse, meaning that a large share (generally > 95%) of the matrix is empty. Just guess how many products Amazon offers and how few of them you ever clicked or bought.

In addition to the information on interactions, we also have data on the entities that interact (users and items). Items come with certain features and analyzing user-item interactions allows us to draw conclusions on users’ preferences for these features. For instance, we can infer a user’s favorite car maker. There may also be some user-specific features, like the user’s device or age. We can leverage this information to improve our recommendations.

But what does a recommender actually aim for? The basic goal of a recommender system is to predict future user-item interactions based on previous interactions and features. Referring to the image above, we want to quantify the question marks to see which car the user is most likely to see next. This goal is denoted as relevancy of recommendations and just one of many like trustworthiness, diversity, or robustness. Scalability is another important and production-oriented goal that forces us to provide recommendations to many users fast (within tens or hundreds of milliseconds). Perfect recommendations don’t matter if you need to grab a coffee while you wait.

The standard technique to approach these goals in recommender systems is collaborative filtering (CF). CF uses similarities between items or users to find appropriate items to recommend. In model-based CF, we use matrix factorization to find dense matrices representing users and items, whose product then reconstructs the original rating matrix. Thus, we come up with lower-dimensional representations for users and items that we can leverage to make predictions. The factorization increases the information density of our sparse matrix. For the vehicle recommendation experiment, we have analyzed approximately 8 million ratings between 100,000 users and nearly 1.7 million vehicles. This translates into 5 out of 100,000 possible interactions. Getting deep insights from this sparse information is pretty challenging.

But sparsity is just one problem, another is that we usually do not know what to recommend to users that come for the first time. For example, see the fourth user in the figure above who didn’t interact with any of our cars. This is described as the cold-start problem. To address this challenge, content-based filtering (CBF) techniques have evolved that leverage user and item features. This makes recommender systems less sensitive to missing collaborative signals. But inferring user preferences towards item features still requires them as we will explore later. In general, content knowledge on users and items provides relief and leverage, but has still some dependencies.

Nevertheless, combining collaborative and content-based information can be even more powerful. We call recommenders that combine different techniques Hybrid Recommenders.

So we know the basic vocabulary of recommender systems, but as this is about deep learning, we need to go deeper.

Deep Learning for Vehicle Recommendations

Using deep learning to improve vehicle suggestions, we have two basic goals:

Increase the relevance of recommendations
Provide them in a scalable way

We have to estimate preference based on observed interactions and user/item features. Let’s distill the task:

Given a user u and an item i, we calculate a score that serves as a proxy for preference. This means that high scores mirror high preference and vice versa. This score allows us to rank a set of items according to their relevance for a given user. The score can be a user’s probability to interact with an item. Thus, we aim to predict user-item interaction probabilities. To compute this probability, we use a deep neural network with a single output unit. This unit uses the sigmoid function for activation, which yields output values in the interval (0, 1). Thus, we can interpret the network output as probability and use it for ranking. The network is then trained on distinguishing between preference and disregard. Therefore, we label all positive user-item combinations with 1 and negatives with 0. As a result, the learning task presents itself as a binary classification task. Excelling at this task can enhance the relevance of our recommendations.

With a naive approach, we compute these probabilities for all possible user-item combinations. Based on that, we would select the k topmost items and present them to the user as personalized recommendations. Due to the large number of users and items in recommendation scenarios this is practically infeasible. We need to find a computationally cheap and fast way to narrow down the corpus of items to candidate sets for each user which still contain items that are likely to be relevant. Therefore, we augment our deep learning approach with approximate nearest neighbor search. We execute this search on dense representations for users and items which requires us to create them in a first step. Thereafter, approximate nearest neighbor search can find good candidates fast and help us to scale well.

Traditional techniques such as CF or CBF calculate these scores using linear techniques which fail to anticipate underlying nonlinear patterns. To capture these nonlinearities, we build learning models with higher complexity. Ultimately, we want to achieve the same thing, but just change the underlying model. Our model combines approximate nearest neighbor search for candidate generation with binary classification for ranking.

User and item representations (collection of feature values) are provided as real-valued, high-dimensional, and highly sparse vectors. Starting with them, we need to solve the following tasks:

Embedding
Candidate Generation
Ranking

We solve all these tasks within our overall model. But this is just one part of our approach, the other is how we handle our data. So we need to get our data in line first before continuing with the model.

Data and a short story on user preference mining

In a first step, we select features that users and items have in common. We split these into continuous features (such as vehicle price or mileage), and categorical ones (such as color and vehicle type). We further define time periods for training and testing our model. Now, we fetch all events that occurred within these time frames and the respective item features valid at the time.

By doing so, we can concentrate on the events and associated items for each individual user. This lets us determine user preferences as an aggregation of the associated item features. By calculating the mean and standard deviation of all vehicles a user viewed, we get some insights on the preferred price range, but also how the user trades off between price and mileage. We can build these user representations using a Bayesian approach and thus craft the same features for users as we have for vehicles. Even though they relate to the same concepts, user representations are stochastic, whereas vehicle representations are deterministic. For instance, the figure below shows a profile, where the user viewed five vehicles. Two of them were black, two were grey, and one was red. This reveals a preference for restrained colors that is reflected by the probability distribution inferred and now part of its profile on the right. For a continuous feature like price, we can do the same. Taking mean and average of vehicles the user interacted with leaves us with an estimate of the user’s preferred price range. Whereas a vehicle can be either black, white, or grey and has a single price.

We can now create a representation based on users’ past interactions. For more on user profiling, you may also enjoy this post on Calculating User Profiles in Real Time.

As we have users and items set, we need to target the interactions themselves. We simplified this problem choosing a binary classification approach. Thus, we label all observed interactions with 1 to denote preference. Since we lack negative feedback signals, we artificially generate them from the positives with a technique called negative sampling. As a result, we get an equal amount of observed positively labeled interactions as we have negatively labeled. Now, we are set to go deeper.

Preference Prediction Model

The overall network consists of three subnetworks as you can see in the following figure: UserNet, ItemNet and RankNet. These networks are combined and trained jointly. Afterwards, we split them to present an overall architecture capable of serving the recommendations in production.

1. Embeddings

UserNet and ItemNet transform sparse user and item representations u and i into dense embeddings eᵤ and eᵢ. Thus, they heavily reduce the dimensionality by compressing the information. Although embeddings lose their human readability, we benefit from them in the latter process, as they are much more memory-efficient. Since the representations differ (stochastic users vs. deterministic items) but relate to the same features, we use separate networks.

With these handy representations for users and items, we begin the process of deriving good items for a given user (embedding). This process is two-fold and comprises the generation of a subset of items from the overall corpus as well as ranking those candidates.

2. Candidate Generation

To quickly find candidates that are likely to be relevant for a user, we use approximate nearest neighbor search. Starting with a user embedding as query, we can efficiently fetch the T closest items for a specific distance metric, e.g. cosine or Euclidean distance. These geometrically similar candidates provide a good repertoire to draw our suggestions from. We use this technique since ranking all available items is too computationally complex. The approximate nearest neighbor search becomes the key to scalability.

There are many implementations, including Locally Optimized Product Quantizations (LOPQ) from Yahoo or Approximate Nearest Neighbor Oh Yeah (ANNOY) provided by Erik Bernhardsson from Spotify.

3. Ranking

As we now have T item candidates for our user, we can use the RankNet to score each candidate. Finally, we just sort the candidates by decreasing score and take the top k most promising ones. These items are then provided as recommendations — and with significant success.

The steps above describe the inference process. Training the model is pretty similar, just without the additional candidate generation.

Results

Lots of data preprocessing, intense modeling and some fancy TensorFlow debugging finally paid off and led to remarkable results. In order to compare our approach, we used different traditional methods as benchmarks and determined the right parameterization with grid search. Working with Python on the bright side of Data Science, we recommend LightFM as a lightweight implementation of different traditional recommender techniques. This allowed us to focus on the deep learning part.

We chose pure CF as well as a hybrid recommender that combines CF and CBF for baselines. We measured relevance using the mean average precision @ k (MAP@k).

We look at the evaluation using different k, whereas k=5 best approximates our use case at mobile.de. For this case, the deep learning recommender outperforms traditional CF by almost 143%. We also improved the powerful hybrid recommender by 73%. Looking at the overall results in the graph below, one might say that a MAP@5 of approximately 1.05% doesn’t look promising, but there are two points. First, given the enormous data sparsity, our restriction to a small time-span as well as a limited user subset, the range of absolute values is artificially kept low. Second, the significant point is the relative increase, which resembles a doubling in recommendation relevance.

But relevance is just one side of the coin. Despite a multitude of other goals, scalability was of specific interest here. Thanks to the candidate generation we were able to generate relevant recommendations in tens of milliseconds and scale to millions of users and items — even though constant vehicle feature updates make this hard.

Conclusions

We see that deep learning is a great tool towards better personalization of services. At mobile.de we are currently bringing our approach into production to help our customers find their perfect car fast. With TensorFlow Serving, there is a convenient application for serving models in production and changing them smoothly. We are also considering the model for other purposes, like improving our search results to leverage user information for higher relevance of search results. To put it in Alan Turing’s words: “We can only see a short distance ahead, but we can see plenty there that needs to be done.”

The full academic work is available in my GitHub repository. I am currently also working on applying the deep learning recommender to the RecSys Challenge of 2017 generating job recommendations with data from the German networking platform XING.

We are not the only ones working in this amazing intersection of research. For a good overview of the current state-of-the-art in deep learning for recommender systems, see this presentation from last year’s Recommender Systems Conference. If you are new to recommender systems, the University of Minnesota offers a helpful specialization on Coursera.

I hope this gives you a good overview. If you feel so, don’t forget to clap before getting your own hands dirty in this field.