Training State of the Art Semantic Search Models

Published in

NLPlanet

8 min readDec 4, 2022

Stable Diffusion — man searching for a book in an ancient library

TLDR — The key innovation which has allowed for huge advances in semantic search has primarily been improvements in training methodology. Naive approaches to contrastive learning don’t present a challenging enough problem to force models to learn interesting representations. Erroneously sampling false negative examples greatly hurts model performance. New techniques combine distillation from larger but inefficient teacher models as well as hard negative mining has allowed researchers to blow past previous baselines.

Introduction

Search is hugely important for creating a good user experience. It is one of the most successful applications of ML commercially and it is only getting better as models build a deeper understanding of natural language semantics. With how quickly the field has advanced, many companies are likely not realizing the full value of their data using the latest methodology.

The last several years have seen an explosion in progress in semantic search with performance on the MSMARCO benchmark, far surpassing what was considered state of the art only recently.

Many other domains have achieved progress through scaling models up, either through model size or the addition of more data. However, state-of-the-art semantic search models aren’t much bigger than their previous counterparts. The main difference in these models is their underlying training methodology. A better understanding of contrastive learning as well as the strengths and weaknesses of different search models has led to multiple innovations that have greatly improved performance without the need for additional resources.

In this article I will:

Give a brief refresher on contrastive learning
Discuss negative sampling and its effect on model performance
Review the state of the art in training methodology for semantic search

I would highly recommend Nils Reimers talk on this topic as well.

Contrastive Learning Refresh

Contrastive learning makes embeddings of similar text(positive samples) closer and non-similar text(negative samples) farther away. A full tutorial is beyond the scope of this article but I would refer readers to Understanding Contrastive Learning for more details.

Anchor (query) — A point from our dataset “Software Eng”. In the context of search, this would be a query.
Positive sample — an example that is similar to our anchor point “backend engineer”
Negative sample — an example that is dis-similar to our anchor point “waiter”

A common formulation of this is through triplet losses where we enforce that the positive be at least epsilon closer to the anchor than the negative.

We can often define positive samples for a given query through manual annotation or user click logs. However, we don’t typically label negative documents for practicality as there are so many within the corpus. This creates an auxiliary task of selecting negative examples from the corpus. Which negatives we choose to select hugely affect model performance as we discuss in the next section.

Negative Sampling — Contrastive Learning Volcano

When selecting negative examples from the corpus, there is a tension between selecting negatives that provide meaningful supervision to the model and erroneously sampling false negatives which I describe with The Contrastive Learning Volcano. The Contrastive Learning Volcano is a function describing the relationship between the value of a sampled negative and the “true” similarity to the query.

Easy vs Hard Negatives

On the long tail of the volcano, we see easy negatives which may be randomly sampled from the corpus. Easy negatives are negatives that are semantically very dissimilar from our query and are thus easy to discriminate from positives. Historically, it has been typical to randomly sample these from the corpus as it is highly unlikely that a random draw would be relevant to a given query.

One issue with the use of random negatives is that they tend to not challenge the model. Consider the below example.

Ex) Random Negative
Anchor - Software engineer
Positive - SDE
Random Negative - Waiter

This triplet can be solved with very little understanding of the difference between software engineer and SDE as the model only needs to determine both titles are engineers to know that it is more relevant than waiter. When training models on this task, it is typical to see training loss approach 0 while validation metrics stagnate. This indicates that this task is unconstrained.

By contrast, hard negatives are examples that are difficult for the model to discriminate. In the below example, the model must understand that software engineer and SDE have a shared component of programming that quality assurance engineer does not.

Ex) Hard Negative
Anchor - Software engineer
Positive - SDE
Hard Negative - Quality assurance engineer

These examples push the model to learn a non-trivial relationship between text and therefore improve generalization. Such valuable examples are rare to find through random sampling which has led to a technique called hard negative mining.

False Negatives

Hard negatives help the model training by forcing the model to learn better representations. However, if the selected negative is more similar to the query than the positive example then this creates noise which can be detrimental to model performance (and we metaphorically fall into the volcano).

Ex) False Negative
Anchor - Software engineer
Positive - SDE
False Negative - Software eng

“Software engineer” is similar to “Software eng” and we don’t want to push it further away. In the following sections, we will describe techniques for sampling these negatives to mitigate the issues described above.

In Batch Negatives

One school of thought on improving training is to sample many random negatives from the corpus. If we sample enough then we are likely to sample hard negatives to improve training quality.

If we get a lot of random negatives we will eventually get some good ones in there

However, we can’t scale the number of negatives very far as we are limited by GPU memory. In-batch negatives are a technique in which positive example embeddings are reused as negatives for other examples within the same batch. I show a figure from CLIP which I think illustrates this well.

CLIP✎ EditSign Contrastive Learning using In-Batch Negatives

RocketQA✎ EditSign analyzes how the absolute number of negatives they can sample impacts the performance on MSMARCO. We see a huge difference just by increasing the batch size from something which is fairly non-interesting to nearing state-of-the-art of dense embeddings. The other major component is hard negative mining which we will introduce in the next section.

RocketQA✎ EditSign analysis of the importance of hard negatives vs in batch negatives

Hard Negative Mining

Rather than hoping to sample hard negatives randomly, it is common to perform hard negative mining where we explicitly look for hard negative examples from the model.

This could be through the structure of your data. Using Pairwise Occurrence Information to Improve Knowledge Graph Completion on Large-Scale Datasets✎ EditSign deterministically limits the space for negative sampling to entities that can have a specific relationship. (i.e. (Berlin, is-capital-of, France) is higher probability than (Berlin, is-capital-of, George Orwell))

Another common technique is to use another search model to select the top K most similar items to the query and use these examples as negatives. Dense Passage Retrieval✎ EditSign was one of the first search papers to do this with negatives from BM25. ANCE✎ EditSign and ColBertQA✎ EditSign do this using dense vectors with the hypothesis that by using a more powerful search model they can get better negatives. The models are trained iteratively with the previous model’s then used to sample negatives for the next iteration.

As shown below, retrievers tend to have different biases regarding what negative examples are shown to the model. In practice, it is often good to ensemble hard negatives from a variety of models (From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective✎ EditSign).

Another technique that is growing in popularity is to dynamically sample negatives during training to allow the sampling distribution to adapt to the needs of the model(RocketQAv2✎ EditSign, AR2✎ EditSign).

Cross Encoder De-noising

A fundamental issue with hard negative mining is that the better our seed model is the more likely we are to sample false negatives. This has been shown to have a highly detrimental effect on overall performance as it introduces noise into the labels. To alleviate this issue, it is common to implement denoising steps where we attempt to remove false negative examples.

Ablation from RocketQA✎ EditSign. Not that without denoising performance is significantly worse

Cross Encoder Denoising

One common technique proposed in Rocket QA✎ EditSign and Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation✎ EditSign is to use larger and more powerful cross-encoder models to filter false negatives. Cross Encoders are much better at modeling similarity but their computational inefficiency has caused more of the research to be directed to embedding-based search.

We can score each sampled negative using a cross-encoder and examples that have a similarity above a threshold are discarded (shown below)

Ex) Cross Encoder Denoising
Anchor - Software engineer
Hard Negative - Quality assurance engineer
False Negative - Software eng

CrossEncoder("[CLS] Software engineer [SEP] Quality assurance engineer") = .55 < Threshold
CrossEncoder("[CLS] Software engineer [SEP] Software eng") = 0.90 > ThresholdRemove "Software eng" as it scores too highly with cross encoder model

This leads to a more complicated training pipeline in which we have to:

Train a bi-encoder with random negatives
Train a cross-encoder on negatives from this model
Train a second bi-encoder using the cross-encoder for denoising

This multi-stage training pipeline can be pretty cumbersome to manage. Newer techniques train both cross-encoder and bi-encoder simultaneously as a knowledge distillation problem. RocketQAv 2✎ EditSign has 2 loss components

Supervised loss for cross encoder
List-wise KL divergence loss between cross-encoder and bi-encoder

Authors show that this approach can far outperform the state of the art on MSMARCO as well as on cross-domain benchmarks.

Conclusion

Training methodology for semantic search has dramatically improved over the previous few years as we have built a greater understanding of the importance of data sampling. By incorporating hard negative mining and in-batch negatives we can maximize the number of “hard negatives” which pushes models to better understand the semantics of our data. By incorporating cross-encoder-based filtering/distillation we can remove “false negatives” which are detrimental to training.

By improving the training methodology for semantic search one can maximize the value of their data without the need for increasing the model size or volume of data. These ideas apply more generally to any kind of contrastive learning problems such as knowledge graph embeddings, joint text-image models like CLIP, or question-answering.