Hybrid Retrieval for optimized search relevance at leboncoin

Published in

leboncoin tech Blog

5 min readSep 18, 2023

By Matyas Amrouche, Senior Data Scientist in the Search Team

Dual lexical and **semantic** similarities with Elasticsearch. Picture from Matt Nelson on Unspash.

Relevance is a critical component of any search system. In a previous post we introduced how we improved leboncoin’s search engine relevance by ranking the most desired ads at the top of our users’ search results. Today, we won’t be looking at the ranking anymore but at its complementary side, the retrieval, which focuses on recall.

According to Wikipedia, the recall is the fraction of relevant instances that were retrieved. For us, it means the pool of relevant ads matching a user’s query. The more relevant ads we retrieve, the better the search engine performance.

Traditionally, the retrieval part of leboncoin’s search system was done by lexical matching (aka keyword matching) using the good old BM25 algorithm built into Elasticsearch.

NB: In short, BM25 is a keyword-based matching algorithm, weighting up or down words’ importance in a corpus of documents. It is an improvement on the famous TF-IDF algorithm you may know as well.

Although the BM25 is a solid baseline for information retrieval, its lack of semantic knowledge makes our users miss relevant ads from our catalogue. For instance, a wireless earphones query can’t be matched with an AirPods ad, nor a wizard suit query with a Harry Potter disguise ad. Indeed, there is no lexical similarity between those couple of queries and ads, but we can see they are semantically close.

What a shame… If only there was a way to leverage such semantic understanding to have more relevant ads to display to our users… 🙏

I guess you can see what’s coming here 🤗! Deep learning models and, more specifically, large language models (LLM) carry within their pre-trained parameters such knowledge.

By leveraging this embedded knowledge, we were expecting our search system to be more robust to respond to misspellings, enhance its synonyms and semantic awareness, and bring additional relevant content to our end users.

And that’s what we saw after multiple AB experiments. Our new hybrid search, combining lexical BM25 scoring and semantic scoring computed from a pre-trained LLM off the shelf, greatly improved leboncoin’s search efficiency and drastically reduced the number of empty search results! You can have a look at real world examples in the Appendix section.

🤖 The Model

So, is this semantic retriever model completely off the shelf, really 🤔…? Well, not exactly but almost!

Figure 1. Pre-training and fine-tuning of Search CamemBERT

1) CamemBERT is a 110M parameters language model pre-trained on ~140Gb of French text on masked language modeling (MLM) task [1], learning a probability distribution over the French vocabulary for masked words in text.

2) In a second step, vanilla CamemBERT is fine-tuned on a dense passage retrieval (DPR) task [2] [3]. This task maps questions to positive passages (containing the answer) and negative ones (not containing the answer). Given that this task is closely related to semantic search, we named this fine-tuned model Search CamemBERT (Figure 1).

Unfortunately, Search CamemBERT’s hidden vector representation dimension is 768… Again the curse of dimensionality is striking ⚡

To deal with this curse, we applied the principal component analysis charm to a subset of our users’ queries and catalogue ads’ title vector representations, projecting the original 768 dimension vectors to a 6x lower dimension space of 128 🪄

Figure 2. Dimension Reduction head on top of the Search CamemBERT

The most computation intensive tasks, the pre-training (MLM) and the fine-tuning (DPR), were already done. The trick was to use a proper dimension reduction head on top of Search CamemBERT to make it scale for our traffic and ads catalogue.

Told you, almost off the shelf! 😉

🚀 The Serving

Enough teasing! Now let’s see how it works under the hood 👀

Figure 3. Ads are stored in Elasticsearch as strings and dense vectors of floats.

First things first, when a user places an ad on leboncoin, it is stored in our Elastic cluster as text data AND as a float vector [4] produced by our semantic retriever (as shown in Figure 3).

Now that we have ads in our Elasticsearch cluster, it is time for our users to search them!

Figure 4. The **retrieval task**, the core of search ❤️

So what’s going on here in Figure 4?

When a user comes with a query, it will be sent to:

the lexical retriever
the semantic retriever

On the one hand, the lexical matching is done with Elasticsearch inverted indexes, as usual.

On the other hand, the semantic matching is done with an approximate nearest neighbors [5] (ANN) strategy using the Elasticsearch built-in HNSW graph [6] algorithm.

Lots of buzzwords here — let’s see what they mean. In short, it is too long to compute all the semantic similarities scores (l2_norm distance in our case) between the query and the ads in our catalogue. So a reduced pool of potential interesting ad candidates is first identified; it is an approximation. Next, only distances between the query and potential ad candidates’ vector representations are computed. Then we filter out ads with a semantic similarity score below a predefined threshold and here we have our semantically similar retrieved ads 🧑‍🍳.

Finally, ads from both lexical and semantic retrievers are mixed into a ranked search result, which will be sent to our Reranker model for optimal sorting before being served to the end users (you can refer to our previous post for more details about the reranking part).

Et voilà! 👌

To conclude, starting from a simple POC on a single category, we ended up scaling and deploying our Hybrid Retriever on almost all leboncoin’s categories. This great team journey led to great results for our users and business!

We always have more tricks in our bag to improve our search system, so stay tuned for the next post! 😉

Big-up to the Search Team’s Backend Engineers (Gilles Barsotti, Paul Dennetiere, Pâris Douady, and Clément Sauvage), who truly nailed the integration of this hybrid solution with our existing stack and made it scale to our whole catalogue!

All images are by the author unless otherwise specified