Dense retrieval of similar products in a C2C marketplace

Xavier Cabanes Bosacoma

Published in

Inside_Wallapop

12 min readApr 2, 2024

This article has been collaboratively written with Iñaki Gorostiaga.

Introduction

Wallapop is a second-hand marketplace founded in Barcelona in 2013, that currently operates in three countries: Spain, Italy, and Portugal. The platform connects the 19 million users who visit it each month, collectively creating over 100 million ads annually. On Wallapop, it’s possible to buy and sell products from all consumer goods categories easily, quickly, and securely, and it’s also a reference point in categories like automotive, where it’s a leader among individuals in Spain.

Wallapop aims to create a unique inventory ecosystem of reused products that facilitates a more humane and sustainable consumption model. Its mission is to empower people to embrace a more conscious and human way of consumption, by connecting sellers and buyers of second-hand products looking for great pricing and convenience. The Personalization team at Wallapop contributes to this goal by trying to give the best possible recommendations of products to our users.

Items at Wallapop are unique, they can be sold only once. This means two different sellers can upload the exact same product with different titles, descriptions, prices… This leads to not only a multiplicity problem but also to sparsity (few interactions per item) and volatility ones (items don’t last long, only a few weeks).

Some of the most well known solutions for recommendations such as collaborative filtering, next product recommendation, usually bought together or popularity aggregations become much harder and far from straightforward to implement.

In this article, we want to showcase what in the industry are called similar items recommendations, which by their nature can be used even with the singularities of Wallapop’s marketplace.

The wall

It is common to find recommendations along different touchpoints such as the home page, the product detail, the post-purchase screen…

At Wallapop, we call the homepage “The Wall”. The Wall is divided into different sections, showing diverse types of content to the user: products, categories, brands, users, searches…

The content for each section is retrieved individually based on different data about the user and their context. Several key recommendations include products from brands you may like, recently visited products, products from your most viewed category, popular searches done near you, new products from your last search…

We will be using an existing recommendation at Wallapop called Similar items to sold to showcase how we applied dense retrieval of similar products in the platform.

Similar items to sold

Do you remember that time you saw a product that you loved but when you went back to buy it, it was gone? The Similar items to sold initiative aims to find products that are similar to one the user already showed interest in, but that they couldn’t purchase because it was bought by someone else. In essence, we track the latest products each user has favorited, started a chat with, or attempted to purchase. When one of these products is sold to another user, we proactively identify and recommend alternative products that meet the user’s needs.

The key part of this recommendation lies in how to find products that are similar to a given one.

A product at Wallapop is characterized by different types of information, both structured and unstructured. It mainly contains:

Title, Description, Images, Categorization, Price, Brand…

Some of these attributes can have a finite set of values, such as categories or brands. However, other fields are unstructured and generated by the sellers, like the pictures or the Title and Description. This causes all products to be unique and can lead to low-quality and noisy data for some products. This poses a big challenge when trying to use this data, and it drives the decision on which type of approach and algorithm to choose when processing it.

To sum up, we want to retrieve items that are similar to a given one, which we will call the seed item, and to do so, we need to find a way to compare an item against our whole catalog.

Dense search vs sparse search

To compare two different items, we can use the available information that we mentioned previously: title, description, price, brand, image… Not all of this data is natively represented by numerical values that have structured meaning. Thus, we first need to encode them with a numerical representation. This way, we can define a similarity function between a pair of items. This function can be used to find items with a high similarity score in our catalog.

In this section, we want to discuss two different approaches to follow when trying to represent them: dense and sparse representations. Both generate numerical vectors given some input data (generally text and/or other types of data) to represent these documents (catalog items). The main differences between these two approaches are the size of these vectors and the amount of non-zero values they hold.

Dense vs sparse vectors from: Dense Vectors: Capturing Meaning with Code

In sparse representations, the vector size is normally in the order of tenths of thousands (the order of the vocabulary). However, almost all of the entries of such vectors are 0, exploiting the efficiency of inverted indices. Some well-known algorithms in this domain are TF-iDF and BM25.

On the other hand, the size of dense representation (also known as embeddings) vectors is generally smaller and fixed, typically in the order of hundreds (independent of the size of the vocabulary); and all of their entries have non-zero values, making them more challenging to index and retrieve. For this reason, various indexes have been developed to facilitate efficient dense retrieval using techniques like Approximate k-NN (ANN) and k-NN, including methods such as LSH, HNSW, IVF …

Sketch depicting the logical flow behind similarity based on embeddings

Not only has retrieval evolved, but also the quality of the information compressed in the vector has advanced rapidly in recent years, especially with the rise of the Transformers architecture. Some of these models include Doc2Vec (DBOW), ELMo (LSTM), and more recently, BERT (Encoder), GPT (Decoder), and similar.

The choice between dense and sparse search methodologies reflects a nuanced trade-off. Sparse searches excel in swift retrieval, offering a dependable baseline without the need for fine-tuning. Their strength lies in exact term matching, ensuring explainability. An industry standard model is BM25, which is supported in Lucene-based search engines such as Solr or Elasticsearch.

However, these advantages come at the cost of vulnerability to vocabulary mismatches, multilingualism, misspellings, and synonyms, along with limitations in capturing the intricacies of semantics and abstract meanings. This challenge becomes evident at Wallapop due to the necessity of offering international items across various country markets and the reliance on user-generated data. Users manually input details such as titles, descriptions, brands, and models, leading to potential inconsistencies. For example, two users may upload the same item but mention different brands or write titles and descriptions with typos.

Conversely, dense retrieval elevates semantic understanding to avoid the already-mentioned traditional constraints. Yet, this requires an additional fine-tuning effort to adapt between domains, in order to successfully surpass other sparse retrieval methods. In our case, having a very broad catalog also poses a challenge when adapting models to very diverse domains. Moreover, vector databases to index these representations require extensive computational resources, and can incur high infrastructure costs due to the lack of traditional sparse inverted indexing.

The strengths of one approach often mirror the weaknesses of the other: embeddings perform well for semantic expansion when the corpus is small, but suffer with computational efficiency when the corpus is large. A combined approach provides a well-balanced and more precise solution. Typically, this ensemble involves boosting scores in both dense and sparse fields, or even integrating them at various stages or merging them in different queries. An alternative approach designed to achieve this balance is SPLADE, which positions itself between these paradigms. SPLADE utilizes term expansion driven by embeddings, which allows the use of inverted indexes while adding the semantic capabilities of dense models.

Text embeddings using LLMs

Large Language Models (LLMs) are one type of Machine Learning model within the field of Natural Language Processing, which have the characteristic of being trained on huge amounts of data to learn billions of parameters. This requires massive computational resources both at inference and at training time.

Most LLMs follow a transformer architecture, which is a deep learning setup that relies on the parallel multi-head attention mechanism. Transformers, in general, can be classified into one of the following types: Encoder only, Encoder-Decoder and Decoder only.

Encoder-Decoder architecture example for translation from: Natural Language Processing with Transformers, Revised Edition

Fundamentally, both encoder and decoder style architectures use the same self-attention layers to encode word tokens, making encoder-decoder models able to share context through that latent space, making them suitable for sequence-to-sequence tasks such as translation. As their name suggests, encoders are designed to encode data -which is usually unstructured- in rich latent low-dimensional space. On the other hand, decoders aim at generating text and thus can be used for tasks such as question answering, chatbots, text summarization, and other generative problems.

For the case of dense retrieval, the encoder architecture is the most widely used, since these models are able to better encode the information. With these representations, documents can be compared by defining similarity measures such as the cosine similarity, inner product, L1 or L2 distances.

Usually, the final layer of encoders produces multiple vectors, each corresponding to a token. To condense this output into a manageable 1D array, pooling is required. Common pooling methods include “mean” or “max” layers, the usage of the “[CLS]” token, or custom functions with a couple of dense layers.

Example of different pooling methods from: Unsupervised Training for Sentence Transformers

The SentenceTransformers library for Python facilitates the usage of state-of-the-art text and image embeddings, abstracting the pooling, evaluation, and fine-tuning of embeddings to a few lines of code. Go check it out!

Offline model evaluation and fine-tuning

In our offline experiments for selecting a language model for embedding generation, we measured usual embeddings evaluation metrics. The evaluation dataset consisted of positive pairs comprising items clicked one after the other, while random items served as negative pairs. The models under study included BERT and DistilBERT-like encoder models with pooling heads, encoders trained from scratch using Masked Language Modeling (MLM) or Denoising AutoEncoders (TSDAE) with pooling heads, pre-trained sentence transformer models, and fine-tuned embeddings. In the offline evaluations, pre-trained sentence transformer models yielded the best results, surpassed slightly only by their fine-tuned embeddings.

Bi-encoder architecture schema from: Cross-Encoders — Sentence-Transformers documentation

The best fine-tuning results were obtained using a Bi-Encoder (siamese Bert-like network) with large batch sizes (512) for in-batch negative sampling and MultipleNegativeRanking loss function while freezing the first hidden layers of the model. The computational efficiency of this domain adaptation approach significantly outperformed the alternative cited from scratch methods. We found this fine-tuning tutorial to be very helpful.

All evaluation metrics used can be found here. The pre-trained model can be found here.

This model was chosen based both on the inherent symmetry within the problem -both queries and documents are considered as items- and the indexability of Bi-Encoders in a vector index, distinguishing them from Cross-Encoders or other generative LLMs commonly employed as rerankers.

From idea to production

The main components of our solution are the Transformer model to build our products’ text embeddings, and a Vector Database, which is in charge of indexing those vectors. With the vectors indexed, it is possible to run similarity searches, i.e. ANN searches, to find those that are most similar to a given one.

To deliver the solution, a set of pipelines was built for computing all the necessary embeddings and recommendations to showcase the desired items. Although we will not dive deep into the implementation of each of them, as a summary we have:

Update pipeline: to compute the embedding vectors for new and edited listings to be included in the Vector Database and to remove the ones already sold, expired, or removed.
Catalog pipeline: to perform a bulk insert of current items of our catalog into the Vector Database.
Seed pipeline: for computing the embeddings of the seed items in batch rather than at query time, since those are deleted by the update pipeline.

These pipelines ran every day, which meant that the computation of the recommendation seed and the updates on the candidate generation were done in batch and not in real-time.

Results & Learnings

The performance of similar product retrieval was evaluated using two approaches in an A/B test experiment: embedding cosine similarity retrieval-only with OpenSearch versus textual and relevance search with our current Solr search engine.

For the embeddings, both the search vector field and the query were computed using the title and description of the documents and the seed respectively. On the other hand, the baseline search retrieval involved querying the search engine with the title of the seed.

We summarize all the aspects of both variants in the following table:

The results of the experiment indicated that users in the new variant converted from seeing the section to interacting with it (clicking or scrolling) slightly less than those in the baseline. The exact numbers for this metric, ICR (interaction conversion rate), are shown below:

This difference might be explained by the fact that embeddings exclusively relied on the similarity score for retrieval, while the sparse search results benefited from using other relevance factors like item-buyer distance, shippability, upload recency and other indexed fields like category and brands. This resulted in a broader overall catalog coverage (locality), more dynamic content (freshness), and improved ranking of results. Additionally, the decay of click-through rates over time for a given seed revealed that embeddings had a faster decay compared to the search results, which also seems to support our hypothesis of the content on the new variant being less dynamic.

Overall evolution through time of the CTR of the same seed recommendation.

Decay of the CTR through time of the same seed recommendation. That is, 1-CTR(t)/CTR(t0)

As a key takeaway, the experiment highlighted that traditional text indexing approaches are robust baselines for retrieving similar items, as well as the importance of including other metadata to account for the search relevance alongside the pure similarity score. However, the need for embeddings for unstructured and multimodal data, whether dense or sparse, emphasizes the importance of hybrid approaches in the present and future of information retrieval systems.

This is why a natural step towards improving this initiative would be to include ranking stages on top of the dense retrievers, to further include important features of the matching items such as freshness or distance. Another suitable approach would be trying to include the dense representations of the items into the sparse-powered existing search service, to leverage both representations and build hybrid retrievers.

Finally, having already developed a solution that involved embeddings, we want to keep exploring the use of such technology in other scenarios. Some use cases we foresee could bring value in the future include product clustering to reduce the number of unique SKUs, the use of images to generate the embeddings, topic modeling over the catalog, or a customer-support chatbot using retrieval augmented generation (RAG) over Help Center documents, where semantic search would be particularly useful since the corpus is small.

If you enjoyed this article and want to learn more, don’t hesitate to watch the presentation we made at the AWS 2023 Summit alongside the AWS team here.