Sparse Representations of Text for Search

8 min readOct 18, 2022

TLDR — While dense embeddings are still the most common approach for search, sparse representations of text can achieve competitive performance when leveraging the recent advances in contrastive learning. These approaches have the additional benefits of interpretability, and offline use for document expansion, and have shown excellent domain transfer performance. In addition, they offer significant benefits when ensembled with dense embeddings.

**Image Credits:** ojogabonitoo / Getty Images

Issues with Keyword Based Text Matching

Before the rise of deep learning in information retrieval, lexical(keyword-based) matching was the standard for modeling the similarities of queries to documents in a corpus. These approaches, such as BM25 represent text as a sparse vector of tokens with weights to score queries against documents. These approaches are efficient but suffer from 2 key problems.

A sparse vector view of keyword matching

Vocabulary Mismatch

Vocabulary Mismatch is the problem of the same concept being referred to using different language. This can happen for things like synonyms. For example, if we have a query for hot dogs we might want to return documents related to sausages as well even though there may not have any overlapping tokens. Another potential issue with natural language search is that natural language may tend to be worded differently than documents may be written in (proper nouns, capitalization, abbreviation,…). Someone would probably be more likely to search “how do I take PTO?” than “how do I take personal time off?” which could make surfacing the correct documents more difficult.

Contextual Importance

Another important shortcoming of these keyword-based models is the idea of contextual importance. Not all words carry the same semantic weight when considering scoring. If a user searches “what is the most hot dogs eaten in 10 minutes?” it is more important that the retrieved documents have “hot dog” in them than “what”. Traditional IR methods like BM25 measure importance according to TF-IDF scores but a lot of context is lost when we boil importance down to frequency. In addition, linguistic features of understanding negation, injunction, and conjunction are not accounted for.

Sparse Models for Search

Query Likelihood Models — Doc2Query

One approach that has been used to solve these issues is document expansion. Document Expansion adds likely search terms to a document to maximize similarity to a set of queries. For example, you might add keywords like Python, Kubernetes, Machine Learning, etc. to your resume to maximize its chance of being scored highly by recruiters looking for those specific skills.

Doc2Query approaches this more systematically using machine learning. The authors train a T5 model to generate relevant queries given a document context. The model can then be used to generate a set of potentially relevant queries that can be added to the index and then perform a normal BM25 search. Note that this can help to alleviate vocabulary mismatch as terms that are not present in the document can be generated from the T5 model. In addition, the model implicitly learns a notion of weight where important words are generated more frequently.

This is part of a family of approaches that I will call Query Likelihood models (TILDE, HDCT) that approach building sparse representations as a density estimation problem.

The primary shortcoming of the QL approach to sparse representations is that we don’t care about the actual likelihood. There is an incentive to add more unique words that may not be represented in the data to increase coverage. What is important is that our representations discriminate good documents from bad ones.

To better illustrate this point let’s look at an example from doc2query

We can quickly see a few things about the generated queries

The generated queries are fairly vanilla. The loss explicitly penalizes choosing terms that aren’t explicitly used in any given query. This training procedure biases towards generating words that will be likely across all queries as a result.
Statistically likely words don’t separate documents. We see words like “what, or, is, and, the” generated frequently. These are certainly likely words by they don’t tell us why a query should be closer to this document than any other document

Sparse Bi-Encoders — SPLADEv2

In Pretrained Transformers for Text Ranking: BERT and Beyond we see that contrastive learning is key to achieving good results for search when using dense embeddings. A natural next step would be to apply contrastive learning to our sparse representations as well.

This has led to a family of works that learn sparse representations in this way (EPIC, COIL, TILDEv2, DeepImpact, SPLADE v2). Here we will take a closer look at SPLADE v2 as it achieves the best results and its practices are(to my knowledge) most closely aligned with that of the SOTA in the dense embedding literature.

SPLADE Model

SPLADE first produces a set of logits over all tokens in the given vocabulary for each token in the input sequence (in the same way that would use an MLM). These logits are summed across the sequence and passed through a relu and log function to give a single vector of non-negative entries whose dimension is the size of the vocabulary. Text similarity is modeled as the dot product of these representations.

Training

The loss for the model incorporates 2 components. The first is the contrastive learning loss in which we maximize the similarity score between a query and a labeled positive document and minimize similarity to negative documents. The authors also add a FLOPS regularization term which favors sparsity in the final representations.

Results

The authors report results for their proposed approaches on the MSMARCO and TREC benchmarks. Authors evaluate bi-encoder models (SPLADE-max, DistilSPLADE-max) which apply the SPLADE model to both the query and documents as well as in document expansion model (SPLADE-doc) which only applies SPLADE to documents during indexing time and simply tokenizes queries to get their representation. The latter can be used more easily in a production setting as it doesn’t require to use of a BERT model in real time but doesn’t have the representational power bi encoding.

The authors show results that outperform other sparse approaches such as doc2query and are highly competitive with the best dense models.

An important theme within the search literature has been how well do these models work when applied to a different domain. BEIR is a collection of IR datasets across a variety of domains that are used to benchmark how effective these models are off the shelf. We see that SPLADE significantly outperforms all other approaches when applied to a wider domain corpus.

SPLADEv2 Examples

Let’s look at some examples of SPLADE in action using the model published in the paper (https://github.com/naver/splade).

In the first example, we show SPLADE applied to an article on cloud computing. We show the non-zero components of the SPLADE representation and their corresponding tokens in the vocabulary. Note that SPLADE identifies architecture and cloud as being the most important themes of the paragraph. In addition, SPLADE gives high scores to words like infrastructure, system, and architect that don’t explicitly appear in the context. This also illustrates the advantage of SPLADE representations being interpretable as we can see what is semantically encoded in them.

It has been claimed in Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index that sparse models are better representations of lexical information in proper nouns. In the second example, we explore how SPLADE represents these. We see that it successfully understands different variations of New York City and links them to other connected ideas. In addition, using a name we see that the model can parse ethnicity and gender from the content.

Sparse vs Dense Embeddings

All of this isn’t to say that sparse embeddings are inherently a better approach to representing language as dense embeddings. Dense embeddings still top the leaderboards for many search benchmarks (although this is likely a biased as more researchers work on dense models). In a more practical sense, common tools like Elasticsearch have much more tooling and support for dense embeddings than sparse embeddings. However, it seems evident that they produce a representation that compliments dense embeddings by better modeling specific linguistic features that embeddings struggle with.

BERT-based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval and A Proposed Conceptual Framework for a Representational Approach to Information Retrieval has shown that hybrid models that ensemble scores from a dense and a sparse model tend to perform significantly better than any individual approach.

This (along with some of the examples that we have shown) seems to support the hypothesis that dense models are better at representing semantic information while sparse models are better at preserving lexical information.

Conclusion

Traditional IR techniques which rely on sparse representations using TFIDF have largely fallen out of favor over the last few years due to vocabulary mismatch and lack of contextual understanding. Dense embedding-based search has largely taken over the literature as the predominant approach.

However, we have found that when leveraging some of the advances in models and training methodology we can get sparse representations that can overcome these issues and get competitive results to dense embeddings. These sparse embeddings seem to model different linguistic features better than dense embeddings and can complement them when ensembled together. In addition, they can be applied in index-only document expansion mode to offer a more computationally friendly option for search. Sparse embeddings have also shown to be highly effective across domains. This makes them a very attractive option to practitioners who are looking for an off-the-shelf solution and don’t have time or data to train their models.

I would like to see more investment in this line of research as it not only represents a competitive approach but also illustrates some of the inductive bias that our choice of model has on our representation of language.