Combining Embedding and Keyword Based Search for Improved Performance

5 min readNov 15, 2022

TLDR — Ensembling keyword and embedding models for search is one of the quickest and easiest ways to improve search performance over the standard embedding-based search paradigms. There is a large amount of evidence in the machine learning literature that supports that this helps with in-domain performance, out-of-domain generalization, as well as multilingual transfer. The reason for this seems to be that sparse and dense representations of text seem to represent complementary linguistic qualities of their underlying embeddings.

Introduction

Embedding-based dense retrieval has become a standard approach to search due to the representational power of embeddings shown in much of the preliminary works in neural IR (Dense Passage Retrieval for Open-Domain Question Answering). It has become a commonplace solution to use out-of-the-box search models such as all-mini-LM when developing search solutions.

However, the true robustness of these models’ ability to represent text across domains has come under question over the previous year. The BEIR dataset is a collection of a variety of search datasets in a variety of domains. Experiments on this dataset have shown that keyword-based sparse models such as BM25 outperform many of these embedding-based approaches in a zero-shot transfer scenario.

BM25 vs Dense Retrieval approaches when transferred to the new domain

Researchers have since found that the simple and effective approach to increasing performance both in domain and cross-domain is to ensemble scores from dense and sparse models. In this review, we will look at the evidence supporting this claim and discuss some of the linguistic differences between dense and sparse representations for search.

Ensemble Search Model Literature

Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index was one of the earlier works which explored combining dense and sparse models of search with the following hypothesis.

Dense vectors are effective for encoding local syntactic and semantic cues leveraging recent advances in contextualized text encoding (Devlin et al., 2019), while sparse vectors are superior at encoding precise lexical information.

Several works have gone on to show that this can significantly improve in-domain performance when combining models on the MSMARCO dataset (BERT-based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval, A Proposed Conceptual Framework for a Representational Approach to Information Retrieval).

Dense-Sparse hybrid models compared with individual models from A Proposed Conceptual Framework for a Representational Approach to Information Retrieval

Hybrid dense-sparse models have also been shown to be highly effective at improving cross-lingual search as well. In Mr. TYDI: A Multi-lingual Benchmark for Dense Retrieval and Towards Best Practices for Training Multilingual Dense Retrieval Models, authors show that mDPR is pretty bad at doing zero shot dense retrieval across languages but offers a significant advantage when combined with BM25.

A hybrid model for cross-lingual information retrieval

Combining Results from Dense and Sparse Models

There are many ways that one might think about combining results from multiple search models. The most common approach within the literature is to linearly interpolate scores between scorers using a hyperparameter alpha.

Linearly Interpolation of scores between dense and sparse models (BERT-based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval)

We often might not want to compute scores for all documents for the sake of efficiency. Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index compares different querying approaches and evaluates their effectiveness

Dense first search — Retrieve candidates using a dense model and rerank using the sparse model
Sparse first search — Retrieve candidates using the sparse model and rerank using a dense model
Hybrid search — Retrieve candidates from both lists and combine them as a post-processing step

Authors find that hybrid search produces the best results. This is supported by BERT-based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval retrieves the top 2000 results from BM25 and DPR and interpolate the results between both lists to get a final ranking.

Dense vs Sparse Representations of Text

Now that we have established how to create hybrid models as well as increased performance, it is a good time to step back and consider why these models do so well. I hypothesize that there are some linguistic as well as mathematical justifications for why this might be.

Importance of maintaining lexical information for representation

Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index claims dense models capture semantic information better while sparse models capture lexical information better. Match Your Words! A Study of Lexical Matching in Neural Information Retrieval studies this more directly by measuring Robertson-Sparck Jones (RSJ) weight as a signal indicating the model’s ability to match important query terms. Authors find that dense embedding-based model are particularly bad at matching out-of-training(OOT) terms.

By combining dense and sparse models, one might think that we can mitigate these issues for queries where lexical matching is very important. You might imagine that when querying proper nouns that this lexical matching would play a very important role where others may rely more heavily on semantics, where embeddings thrive.

Loss of Information with Single Embeddings of Long Documents

One limitation of embeddings in semantic search is that embeddings are that reducing an entire document to one embedding loses a lot of information. While a query may contain a singular subject and intent, documents will typically cover a wide variety of information.

Consider generating a single embedding for the Wikipedia page for Led Zeppelin. This page contains information on the band members, a history of the band, names of popular songs, etc. During training, we would learn a single vector for this document which must be close to the wide variety of queries that could be relevant to it. This single vector loses a lot of the individual nuance of the different topics within a longer document by mashing them all together.

To this effect, researchers have developed many techniques which represent text using multiple embeddings rather than single vectors such as ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT.

Conclusion

When attempting to improve embedding search model performance, there has been a lot of investment in complicated techniques such as complicated training pipelines, generating synthetic data, multiple-stage rankers, etc. However, one of the simplest and easiest to implement improvements is to simply combine results from embedding search and sparse models such as BM25. This has been shown to improve in-domain performance as well as improve model generalization to new domains or languages. This phenomenon also shines a light on some of the limitations of embedding-based representations of text which have become standard within NLP.