Embedding based retrieval for search & recommendation

Jaideep Ray

Published in

Better ML

5 min readJul 13, 2021

TLDR :

Use embedding based retrieval for leveraging rich semantic features from query and document.
Two tower DNN can be trained and served efficiently for large scale search and recommendation setup.
Use a blend of documents retrieved from embedding index and term inverted index for best results.

1. Retrieval as large scale classification task.

The traditional retrieval system in search and recommendation involves inverted index based term matching. This has a few drawbacks : semantic concepts are not used in retrieval. Embedding representation cannot be used directly. So, rich information from content side cannot be used.

The retrieval task i.e fetching a set of relevant documents for a given input query can be considered as an extreme multi-class classification.

Each document in the corpus represents a class and the network is being trained to identify the right document for the query.

2. Retrieval setup

Embedding based retrieval is based on nearest neighbor search engine. Given a query embedding (q_e), the engine searches for K nearest neighbors in document corpus (d_e). The distance is usually the dot product of q_e and d_e.

In this setup, only q_e needs to be computed real time. Document embeddings can be computed in an offline batch mode. This makes serving from embedding retrieval extremely fast.

3. How to train query & document embeddings ?

The embeddings are trained as a part of two tower network (one for query and one for document) and labels are usually an external user feedback like click.

The two-tower DNN model architecture is used for computing logits e(x,y). As shown in Figure, the left tower and right tower learn latent representations of given query and item separately. These towers map query and item features x and y to a shared embedding space. The dot product produces logits.

Note that query embedding and document embedding are independent of each other and can be used separately during serving.

4. Why use two-tower networks ?

The key design principle for such two tower architecture is to make the query embedding and the item embedding independent on each other after the model is trained.

1/ During serving these can can be computed separately. All document embeddings can be computed offline in order to build an document embedding index for fast nearest neighbor search online, and the query embedding can be computed online to handle all possible user queries.

2/ Even though the embeddings are computed separately, due to the simple dot product interaction between query and document towers, the query and document embeddings are still theoretically in the same geometric space.

3/ Retrieval = Finding K nearest items for a given query embedding is equivalent to minimizing the loss for K query item pairs where the query is given.

Paraphrased from [2].

4. Challenges in training two tower DNN

A big challenge in training such multi-class classification model with softmax is the training cost when the number of classes is huge (e.g. documents in a corpus can be millions).

In each training step, you would have to back-propagate the loss through all documents in training set (clicked document is positive, all others negative). The training iterates through training set. Computing scores for all items is unscalable. Therefore we have to use a sampling mechanism and compute the scores for only a subset of documents during training.

[Batch negative sampling] A commonly-used sampling strategy for two-tower DNN model is random batch negative sampling. This is uniform sampling for negatives from training corpus.
Specifically, batch negative sampling treats other items in the same training batch as sampled negatives and therefore the sampling distribution Q follows the unigram distribution based on item frequency.

It avoids feeding additional negative samples to the right tower and thus saves computation cost. Given B pairs of {query, item} in a batch, features of B queries and B items go through the left and right towers of the model. Each item embedding is being used as a negative example for B-1 queries.

5. Controlling bias and variance of the gradient estimator during training

Random sampling has two problems :

Selection bias : No way of including documents which are not in training data. Solution : increasing the sample size.
Lack of control on proposed distribution.

A mixed negative sampling strategy : Maintain a separate corpus of documents sampled from whole corpus. This is index data. For every batch (B), use the index data along with batch negatives (B’).

This addresses both the problems, (B + B’) reduces selection bias, specially because B’ is sampled from entire corpus.

You can vary B’ to reduce distributional differences.

6. Overall retrieval system

With nearest neighbor search improvements in recent times (FAISS [3]), embedding based retrieval is possible.
One drawback that embedding based retrieval suffers from is over generalization of query. For navigational query intent : person names, specific entity searches, needle (queries with very definite intent), newsy queries embedding doesn’t do a good job. Embedding based retrieval does a great job in discovery and leveraging rich semantic features of query and user context.
Simple blending of documents from embedding and inverted index can solve the above problem. Since, there will be multiple stages of relevance ranking above this, we can include many candidates from both embedding and inverted index.

References:

[1] Mixed Negative Sampling for Learning Two-tower Neural Networks in Recommendations : https://research.google/pubs/pub50257/

[2] Towards Personalized and Semantic Retrieval: An End-to-End Solution for E-commerce Search via Embedding Learning https://arxiv.org/pdf/2006.02282.pdf

[3] https://github.com/facebookresearch/faiss