How to Build a State-of-the-art Text Embedding Model

A deep dive into one of the core technologies behind Retrieval Augmented Generation (RAG)

Coauthors: Daniel Campos, Danmei Xu

Consider these two queries: “What is Tom Cruise’s height,” and “How tall is Maverick from Top Gun?” They mean the same thing but lack shared words. How could a computer recognize these as similar? The modern AI solution is embedding functions that transform such pieces of text into mathematically similar vectors, enabling a search system to retrieve the same correct results! Embedding functions are so crucial to domains like information retrieval and generative AI that extensive research has gone into designing and measuring their quality. Popular benchmarks include MS MARCO, BIER, MTEB, and TREC [3,4,5,6]. At Snowflake, we handle 4 billion queries a day [15], and we don’t just use embedding functions; we build them. In this article, we’ll explain how text embedding models are trained, including the five tricks model-builders use to get to the top of the quality rankings. Let’s dive in!

A (very not comprehensive) history of embeddings

Using text embeddings to find relevant information is hardly new — search engines have been making documents into vectors for decades. Even traditional keyword-based search can be considered a form of vector retrieval where the vectors represent all possible words and most values are 0. Keyword search has commonly been cast to vectors to use learned or extracted features, as shown in [1]. For example, the below image shows what embedding vectors look like when they’re tied directly to a document and recognizable semantic features.

A visualization of the relation between keywords and vector representations from page 34 of [1].

However, with the proliferation of language models like BERT, creating text embeddings without a team of researchers has become much more straightforward, and text embedding quality has also improved. Modern language-model-based text embedding methods are pre-trained to consider the context in which words are used (e.g., “fly” can be a verb, noun, and adjective), delivering strong search quality even in cases where keyword-based search flounders. The below image gives a stylized 2d take on modern embeddings in which the features (or dimensions) are opaque (and, in actuality, numerous). Despite the opacity of the individual features, semantic text similarity is preserved by the geometric relationships between embedding vectors.

Language-model-based embeddings have taken over information retrieval

After researchers found it possible to deliver impressive retrieval performance simply by comparing the language-model-based text embedding vectors of questions with the corresponding vectors of documents [2], this line of research quickly went from niche to ubiquitous in information retrieval applications. Countless experiments have since improved how text embedding models are structured, how documents are represented, and what loss functions are used to train models, cementing model-based text embeddings as a cornerstone of the modern search stack.

With all of the developments in recent years, it can seem challenging to keep up with the so-called state-of-the-art (SoTA). However, the goal of keeping up with SoTA is made tractable thanks to standardized public leaderboards and retrieval datasets. High-quality datasets and tracking leaderboards such as MS MARCO [3] (which our own Daniel Campos helped create), BEnchmarking Information Retrieval (BIER) [4], and Massive Text Embedding Benchmark (MTEB) [5], along with longstanding research from NIST’s Text REtrieval Conference (TREC) [6] have provided not only a testbed for the quick experimentation that has led to the significant improvements witnessed in the last few years, but also led to a clear quantitative ranking of top-performing methodologies. Additionally, it appears that this standardization of evaluation may have also played a role in the convergence of many leaderboard-topping efforts to the same general training recipe, as many top-ranking models such as E5, BGE, GTE, Jina, and Nomic [7, 8, 9, 10, 11] use roughly the same training approach under the hood.

The evolution of state-of-the-art text embedding systems is full of BERT-based models, as per [12].

The modern state-of-the-art recipe for building embeddings

Though training recipes may vary slightly model-to-model and not all leaderboard-topping models have published their training details, a surprisingly wide swath of the published recipes for top-scoring text embedding models leverage the following tricks of the trade.

Trick 1: Start with a pre-trained general-purpose language model

The first pre-trained-language-model-based text embeddings used pre-trained language models like BERT “off the shelf,” treating the output vector corresponding to the special [CLS] token as a semantic representation of text input — no training “recipe” needed! While not competitive with classical retrieval techniques, this initial approach worked well enough to prompt further research, and soon enough researchers figured out how to achieve excellent performance by fine-tuning BERT specifically for information retrieval. For several years, the starting point of choice for training a text embedding model has been BERT. Although some groups have tried leveraging much larger (e.g. 7B parameter) generative LLMs as their backbone architecture, even in 2024 a modest ~100M-parameter BERT backbone can hold its own in text embedding applications.

Trick 2: Fine-tune for information retrieval with contrastive loss

The current dominant approach to fine-tuning text embedding models employs a contrastive loss function that explicitly considers the downstream use of the text embeddings — scoring the similarity between queries and documents by the vector cosine similarity metric. A contrastive loss function such as infoNCE loss (below) steers models to learn a notion of relevance by minimizing the distance with positive pairs ( a query and a relevant document) while maximizing the distance to a negative pair (the same query and an irrelevant document). InfoNCE loss has become a mainstay in the “modern SoTA recipe,” while others, like AnglE-optimized loss [13], have also appeared. A detailed breakdown of contrastive learning is beyond the scope of this post, but Lilian Weng’s blog offers a great introduction to contrastive losses in [14] for those interested in some additional reading!

InfoNCE loss decreases as the similarity between a query q and the correct document k+ increases, and InfoNCE loss increases as the smooth maximum of the similarity between the query q and all documents k_i increases. When the similarity between q and k+ is high, and the similarity between q and all other k is low, InfoNCE loss approaches its minimum value of 0.

Is straightforwardly fine-tuning with a contrastive loss all we need to do to achieve state-of-the-art retrieval performance? The answer here is actually both yes and no. Fine-tuning with contrastive loss is technically all that today’s leaderboard-topping methods do. However, a straightforward application of contrastive training is no longer competitive — several additional tricks are needed! In particular, the state of the art improved dramatically when the E5 model [7] introduced both the “prefix trick” and the modern two-stage contrastive training recipe, with E5 becoming the first language-model-based text embedding method to outperform classical keyword-based retrieval algorithms on the BEIR benchmark.

Trick 3: Prefix your queries

The architecture of backbone language models like BERT offers no direct way to differentiate between queries (which are often short, like “Tom Cruise”) and documents, which are often long (like the Wikipedia article on Tom Cruise). This makes it tricky for these models to properly embed queries and documents into geometrically close vectors. While initial attempts solved this problem with “twin” architectures that involved separate models for queries and documents, the modern SoTA recipe solves this simply by appending some special keywords to queries to differentiate them from documents. For instance, E5 prefixes all queries with the text “query: ” and all documents with “passage: ” to disambiguate the two during training and inference, while later models like BGE opt for longer prefixes on the query term and no prefixes for documents. In practice, research has shown that this small dose of instruction tuning goes a long way towards boosting retrieval performance! [7, 8]

Trick 4: Scale to large batch sizes to optimally leverage in-batch negatives

The first stage of the modern two-step training process is to leverage a large, weakly supervised dataset constructed from internet crawl data and train with a very large batch size. In this regime, contrastive training is feasible without explicitly labeled negative examples since as the batch size increases, there are more so-called “in-batch negatives” (negatives derived from positives that are relevant to other queries) to teach the model. At a certain scale, these in-batch negatives do quite a good job teaching on their own, even without any explicitly labeled negatives in the mix. In practice, it is common to use a combination of activation checkpointing [16], multiple-GPU training [17], and aggressively truncated sequence lengths [18] to achieve batch sizes of over 10,000 query-document pairs per batch.

Trick 5: Finish training with some hard negatives

Unfortunately, this first stage of training alone won’t land you on the leaderboard. After the first stage of large-scale training, the modern SoTA recipe calls for a second round of contrastive training using smaller batches of high-quality data that contain labeled “hard” negatives that are particularly tricky for models to score as less relevant than the labeled positives. Common datasets used in this second round of training include MSMARCO, HotpotQA, and NQ. To identify the best “hard” negative examples to train with, it is common to apply the method of Approximate nearest neighbor Negative Contrastive Estimation (ANCE) [19], using checkpoints of the model being trained to identify the most informative negative documents for each of the queries in the training data.

Why does this work?

As numerous teams have shown, if you implement the modern two-stage training recipe carefully, it is possible to transform a pretrained BERT model into a state-of-the-art text embedding model that ranks highly on BEIR and MTEB. Rather than take this at face value, though, let’s try to develop an intuitive sense of why this two-round training regime works so well. In particular, we’ll examine a side-by-side evaluation of two publicly available checkpoints from the E5 project — e5-unsupervised-base (a checkpoint taken before second-stage training), and e5-base-v2 ( a checkpoint from after second-stage training). In the table below, we see that performance generally increases with the second stage of training, with the most notable gains coming in the same datasets that the second-stage training data is drawn from (HotpotQA, MSMARCO, and NQ in this case). We also notice that the scientific domain scores (SciDocs and SciFact) decrease slightly with second-stage training, which makes sense given that the CCPairs dataset developed by the E5 authors for large-scale training drew heavily from scientific papers, while the second-stage training datasets used generally lack scientific domain data. This comparison shows us that the large-scale training lets us draw upon large and diverse datasets of weakly-labeled data to give the model a strong foundation across domains, while the hard-negative training lets us more effectively leverage high-quality datasets like MS MARCO to refine the model and push performance higher than large-scale training alone. As a side note, for practitioners who care about retrieval in a particular domain like technical or legal writing, it may be helpful to dig into the recipe behind popular open models like E5 and consider running your own second-stage training on top of the open-source unsupervised checkpoint!

Here, we evaluated two different checkpoints of the same E5 model using a selection of BEIR datasets. Scores in nDCG@10.
|                      |   beir-climate-fever |   beir-dbpedia |   beir-fiqa |   beir-hotpotqa |   beir-msmarco |   beir-nf-corpus |   beir-nq |   beir-quora |   beir-scidocs |   beir-scifact |   beir-touche-2020 |
|:---------------------|---------------------:|---------------:|------------:|----------------:|---------------:|-----------------:|----------:|-------------:|---------------:|---------------:|-------------------:|
| e5-base-unsupervised | 0.1508 | 0.3487 | 0.4011 | 0.5153 | 0.2615 | 0.3581 | 0.3644 | 0.6282 | 0.2112 | 0.7384 | 0.1284 |
| e5-base-v2 | 0.2617 | 0.418 | 0.3984 | 0.6804 | 0.4188 | 0.3549 | 0.539 | 0.6638 | 0.1866 | 0.7194 | 0.202 |

Conclusion

Phew, that was a lot! Now that you’ve made it this far, you’ve seen how text embeddings are used to power the modern search stack, flown through the world’s most abridged history of the evolution of modern text embedding models, and learned the tricks used by leaderboard-topping models like BGE. We hope you’ve gained a deeper appreciation of how embeddings go beyond keyword match to make retrieval systems smarter, and also hope that following the evolution of text embedding models from plain-old-BERT to specially-fine-tuned systems that leverage contrastive training has deepened your intuition surrounding these tools. After our breakdown of the tricks of the trade, you should now feel more confident in designing your own competitive training recipe, too. Remember, by using a query prefix, running large-scale training to maximize the value of in-batch negatives, and finishing with training on hard negatives, even a humble BERT-base model can be transformed into a specialized embedding system that delivers top-tier performance on leaderboards like BEIR!

Though this post covered a lot, we only touched upon a small slice of the analyses of SoTA text embedding training recipes we’ve been running at Snowflake. Embeddings are an active area of research at Snowflake, as partly evidenced by our recent work on the models behind Snowflake Universal Search, and even as this article lands on our blog, our GPUs are running hot as they crunch through numerous text embedding experiments. We are looking forward to sharing more about this ongoing work as our in-progress projects come to fruition!

Define the future of AI with us

Snowflake’s AI technology supports some of the most consequential enterprise AI workloads in the world. Help us push the state of the art of enterprise AI at http://snowflake.com/careers

Acknowledgements

We would also like to thank Adrien Treuille for his helpful feedback editing this article.

References

[1] An Introduction to Neural Information Retrieval https://www.microsoft.com/en-us/research/publication/introduction-neural-information-retrieval

[2] Dense Passage Retrieval for Open-Domain Question Answering https://arxiv.org/abs/2004.04906

[3] MS MARCO: A Human Generated MAchine Reading COmprehension Dataset https://arxiv.org/abs/1611.09268

[4] BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models https://arxiv.org/abs/2104.08663

[5] MTEB: Massive Text Embedding Benchmark https://arxiv.org/abs/2210.07316

[6] NIST Text REtrieval Conference (TREC) https://trec.nist.gov

[7] Text Embeddings by Weakly-Supervised Contrastive Pre-training https://arxiv.org/abs/2212.03533

[8] C-Pack: Packaged Resources To Advance General Chinese Embedding https://arxiv.org/abs/2309.07597

[9] Towards General Text Embeddings with Multi-stage Contrastive Learning https://arxiv.org/abs/2308.03281

[10] Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents https://arxiv.org/abs/2310.19923

[11] Nomic Embed: Training a Reproducible Long Context Text Embedder https://arxiv.org/abs/2402.01613

[12] MS MARCO: Benchmarking Ranking Models in the Large-Data Regime https://dl.acm.org/doi/10.1145/3404835.3462804

[13] AnglE-optimized Text Embeddings https://arxiv.org/abs/2309.12871

[14] Contrastive Representation Learning https://lilianweng.github.io/posts/2021-05-31-contrastive/

[15] Based on internal Snowflake data between January 1, 2024 and January 31, 2024, we handle 4.2 billion queries a day on customer datasets that sometimes exceed 50 trillion rows.

[16] Activation checkpointing https://pytorch.org/docs/stable/checkpoint.html

[17] Getting Started with Distributed Data Parallel https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

[18] Flag Embedding (BGE) source code https://github.com/FlagOpen/FlagEmbedding/blob/af6ee2d37f1bf4fc20821264e437b6097005f88f/FlagEmbedding/baai_general_embedding/finetune/data.py#L70-L71

[19] Approximate nearest neighbor Negative Contrastive Estimation (ANCE) https://arxiv.org/abs/2007.00808

--

--