Exploring collaborative filtering versus a content-based approach for similar classified ads

Published in

leboncoin tech Blog

7 min readApr 8, 2024

By Justin Reboullot, Data scientist

With about 70 million classified ads online at any given time and 29 million users (as of March 2023), leboncoin is the second most popular e-commerce website in France.

Leboncoin is an online marketplace where people can post classified ads for items they want to sell or services they offer. These ads are then accessible to other users who can contact the sellers if they’re interested in making a purchase or arrangement.

With so many ads available, recommendations are crucial to help users find what they are looking for, which is why we have several areas on the website where recommendations are offered. In this article we’ll be focusing on what we call similar ads, which are recommendations placed at the bottom of item pages to suggest similar items.

We will be comparing collaborative filtering with a content-based approach for similar ads based on our experience at leboncoin.

The RecSys specifics of leboncoin

In order to fully understand the analysis we’re making in this blog post, you need to keep in mind the RecSys specificities of leboncoin.

In a nutshell, we’ve got:

Many users: About 29 million.
Many ads: About 70 million.
Many interactions between users and ads: About 10 billion views per month.
Ads with a short lifetime (unlike streaming platforms for music or videos).
No stock: Ads can be bought only once (as opposed to Amazon).
Often no record of who buys what: Most people make payment in person and do not use online transaction billing.
A wide range of quality in ads: Each user writes their own ads.
A wide variety of ads: Cars, real estate, jobs, goods, clothing, holiday rentals…

With this context in mind, let’s examine what it means for classified ads to be similar on leboncoin.

What does it mean for two ads to be similar?

Two ads are similar if they have been seen consecutively by some users on leboncoin’s website. It is similar to a next item prediction or mask learning technique, and can be formulated as “those who have seen this have also seen that.”

Now that the terminology has been clarified, let’s dive into the first approach we tried: Collaborative filtering.

1st approach: Collaborative filtering

In this post, we define collaborative filtering as any approach that uses only ad IDs. One way to build a recommendation model following this approach is to use Word2vec.

The Word2vec algorithm

The Word2vec algorithm was published in 2013 by Google researchers. It’s a widely used technique for generating word embeddings, which are dense vector representations of words in a continuous vector space. These embeddings capture semantic relationships between words based on their context in large text corpora.

In the CBOW version, the algorithm learns these embeddings by trying to predict each word given its neighbors.

For example, given the sentence “I use leboncoin every day”:

Predict “I” given [“use”]
Predict “use” given [“I”, “leboncoin”]
Predict “leboncoin” given [“use”, “every day”]
Predict “every day” given [“leboncoin”]

Source: Distributed Representations of Words and Phrases and their Compositionality (Mikolov et al. 2013)

Applying the Word2vec algorithm to similar ads

For similar ads, we use ads’ IDs as words and click sequences as word sentences.

For example:

Given click sequence: ad_1, ad_2, ad_3, ad_4, ad_5
Predict:
- P(ad_2 | ad_1)
- P(ad_3 | ad_2, ad_4)
- P(ad_4 | ad_3, ad_5)
- P(ad_5 | ad_4)
Repeat this for a lot of sequences

Implementation details

Initially, leboncoin used Gensim for implementation, a natural language processing library that wraps a C library that codes Word2vec.

Our decision to switch to TorchRec was based on the following reasons:

Unlike with Gensim, the architecture can be customized.
The embedding layer can be sharded (tensor parallelism): TorchRec allows the matrix to be divided into smaller pieces and sent to different GPUs.
Efficient embedding layers are provided.

Production architecture

Several times a week (we could do it more often, but we have to consider the costs), we extract a training dataset from our data lake and train our model with Word2vec.

This training results in ad embeddings. Using them, we can create a k-NN index, which is a data structure that is efficient in searching for the nearest neighbor.

Isn’t that great? However, this collaborative filtering approach has one major drawback…

What about ads that have never been seen?

Ads that have never been seen are not considered in the collaborative filtering approach, which is known as a cold start problem. Because of the short lifespan of our ads, this is particularly problematic for us.

Therefore, we completed another approach using a content-based model.

2nd approach: Content based

The content-based approach can use all possible features except the ad ID. Although we call it content based, it also uses user-item interaction data.

Training dataset

In each line, there are two ads, called x and y. Users see y after x. The algorithm attempts to determine whether x and y are similar.

In order to achieve this, we use a Siamese network architecture.

Siamese network architecture

The embedding of an ad is generated by taking the features of the x ad and putting them in a neural network, the ad encoder. The y ad is then processed in the same way using the same ad encoder. By doing the dot product of these two embeddings, we get the similarity score.

We use in-batch negative sampling to train this Siamese network.

In-batch negative sampling

*x_1 looks like y_1, x_2 looks like y_2, x_3 looks like y_3*

We run our ads through the neural network. Six ads result in six embeddings. Then we do a dot product between each x embedding and each y embedding.

This gives us nine scores. The diagonal will represent positive labels, whereas the scores outside it will represent negative labels.

We know that x_1 looks like y_1. As a result, s_11 will be labeled 1. However, we know that y_2 doesn’t look like x_1. Thus, instead of sampling the negative scores among all available ads, we take them within the batch. Then the loss can be calculated.

What is the purpose of doing that?

Using only 2*B forward passes in our neural network, we get B*B scores. It’s quite useful when your neural network is compute-intensive.

Open AI and YouTube are two examples of companies that use this approach.

Production architecture

The production architecture differs slightly from that of the collaborative filtering approach. Instead of having one pipeline that we repeat frequently, we have two. The first pipeline is repeated every so often (every 3 or 6 months) to allow us to train the Siamese framework. The second pipeline is run once a day or more to update the k-NN index.

We do not need to train the Siamese network frequently because all the features it uses are quite stable (there are no IDs). For example, the model will learn that a blue chair in Paris will be similar to a blue chair in the neighborhood of Paris. This same fact also enables this approach to handle the cold start problem. If tomorrow, another blue chair is added to our website in Paris, we simply need to compute its embedding with the pretrained Siamese network and add it to the k-NN index. However, since we update the index only in batches, there may still be some cold start issues. That’s why we are currently transitioning to a real-time index update such that we completely get rid of this cold start problem.

Our learnings

The collaborative filtering approach requires no feature engineering. It is a good way to start, as it’s easy to set up and you do not have to worry about how you represent the text of the ad image or the parameters — the ID is all that matters.
Collaborative filtering requires frequent retraining: The model must be retrained often vs every 3/6 months for Siamese networks.
Having a content-based approach can lead to missing features: Either because you forgot them or because they are too complicated to take into account.
You can fail to combine features in your content-based architecture: For us, the problem occurred while attempting to combine text, image, and data tables for an ad. There are times when you add a feature but the vector doesn’t consider it.
When dealing with large amounts of interaction data, we found that the collaborative filtering approach is more effective.

However, keep in mind that all those observations may change with Foundations LLM (Amazon Titan and Open AI, for instance).

Exploring collaborative filtering versus a content-based approach for similar classified ads

The RecSys specifics of leboncoin

What does it mean for two ads to be similar?

1st approach: Collaborative filtering

The Word2vec algorithm

Applying the Word2vec algorithm to similar ads

Implementation details

Production architecture

What about ads that have never been seen?

2nd approach: Content based

Training dataset

Siamese network architecture

In-batch negative sampling

Production architecture

Our learnings

Written by leboncoin tech