Hybrid Search: SPLADE (Sparse Encoder)

Neural retrieval model

Sowmiya Jaganathan
5 min readJul 9, 2023

Recently, Elasticsearch introduced a new machine learning model called ELSER (Elastic Learned Sparse EncodeR). This model provides the semantic search experience when using an inverted index.

Unlike traditional keyword matching, ELSER focuses on providing search results based on the semantic meaning of the query. Excludes the ruled-based query expansion pipeline and relives the users from maintaining the synonyms table for basic terms.

In this Blog, let’s understand the model workflow behind the ELSER in layman’s terms to make a choice.

In IR, search is a fascinating and subjective area. It’s important to understand how they work individually to understand how they can be effectively combined.

SPARSE retrieval:

Elasticsearch stores the terms and their occurrences in an inverted index. Uses the OkapiBM25 similarity algorithm to calculate the relevance between the query and documents.

  • The term frequency component measures how often a query term appears within a document, giving more weight to rare terms.
  • The inverse document frequency factor considers the overall frequency of a term in the entire collection, penalizing common terms.
  • Document length normalization helps to adjust for variations in document length, ensuring fairness in scoring.

By combining these factors, Okapi BM25 generates a relevance score for each document-query pair. It reduces the search space by indexing terms and their occurrences, enabling efficient retrieval.

Sparse Retrieval

Dense Retrieval:

Dense vector retrieval leverages a pre-trained language model like BERT to create context-rich embeddings. Performs the similarity metrics like cosine similarity between the query and the document vector to retrieve the relevant documents.

Dense retrieval aims to capture the semantic relevance between the query and document beyond keyword matches.

Dense Retrieval

Both approaches have their own strengths and characteristics that can be beneficial when combined effectively.

Let’s understand SPLADE, the architecture behind the ELSER.

SPLADE

SParse Lexical AnD Expansion Model for First Stage Ranking

The concept behind the SPLADE is to leverage the pre-trained lexical models like BERT to capture the semantic and syntactic information embedded within words, aiding in determining their relevance to the retrieval task. This contextual understanding enables term expansion, which involves incorporating alternate relevant terms.

Here, one might ask that term expansion can be done, through a simple ruled-based model. The distinction lies in how alternate words are generated with contextual knowledge from the sentence and neighboring words/terms. This contextual approach ensures that the expanded terms are more relevant to the specific query or document being processed.

Now sparse vector embedding is enriched with contextual alternative words. The retrieval process in SPLADE involves computing the dot product between two sparse weight vectors.

Work Flow of SPLADE

Lets us understand how the term expansion is achieved with a lexical understanding of the BERT architecture.

  1. Tokenization: The input sequence is divided into individual words or tokens.
  2. Embedding: The tokenized input sequence is converted into an embedding matrix, which represents each word with a dense vector.
  3. Encoding with Attention: The embedding matrix goes through an encoder, which uses attention mechanisms to learn the relationships between words and capture contextual information. This results in a context-rich or information-rich embedding matrix.
  4. Task Layer: A task-specific layer is added on top of the encoder output to transform the embeddings for the desired task, such as sentiment prediction or classification. Additionally, many models include a Masked Language Model (MLM) step, where random words are masked, and the model is trained to predict the correct word/token.
  5. MLM Output: In the MLM step, the model predicts the probability distribution (logits) across the entire vocabulary, typically consisting of 30,522 tokens (the BERT vocabulary size). The highest activation in this distribution corresponds to the prediction for the masked token position.
  6. Importance Estimation: SPLADE takes the probability distribution from the MLM step and aggregates them into a single distribution called the “Importance Estimation.” This distribution represents the sparse vector, highlighting relevant tokens that may not exist in the original input sequence.

𝑤𝑗 = ∑︁ 𝑖 ∈𝑡 log 1 + ReLU(𝑤𝑖𝑗)

Here, SPLADE has introduces a minor tweak here is log saturation which prevents some words from dominating. When a few highly frequent terms dominate the retrieval process to the extent that they overshadow other potentially informative terms handled with SPLADE.

Secondly, the negative sampling technique is used for efficient learning.

Third is the optimization, with the expandable technique the inverted index becomes too big and latency will increase. Using the FLOPS regularisation technique avoids this by penalizing token contributions and understanding the importance of each token by diverting a set of queries. In the end, regularization is applied to the token which has a smaller contribution to the low contribution of the loss. This process effectively removes those tokens and encourages the model to focus on the most relevant and discriminative terms. It helps to obtain a well-balanced index.

NaverLabs
https://europe.naverlabs.com/blog/splade-a-sparse-bi-encoder-bert-based-model-achieves-effective-and-efficient-first-stage-ranking/

SPLADE V2

Later on(in 2021), SPLADE V2 introduced more efficient and effective retrieval strategies.

Max Pooling Strategy: SPLADE V2 utilizes max pooling to select the most dominant activation within a region. This helps reduce the dimensionality of the representation, making it more efficient.

Document Encoder Only: SPLADE V2 focuses on document encoding, and expansion rather than query. This saves retrieval time since document term weights can be computed offline.

Distillation technique:

  1. SPLADE provides an initial ranking of documents by capturing the relevance of queries and documents using its architecture.
  2. The cross-encoder is an additional model trained which computes a relevance score or ranking for query-document pair. Here the training process involves positive (matching query-document pair) and negative examples(non-matching query-document pair) called triplets generation.

(query, relevant document, irrelevant document)

Now, the triplet generation can be two ways,

  1. Generating triplets using the SPLADE model where the dataset can create to retrieve the results. The positive and negative examples can be manually annotated, Random sampling and etc.
  2. Distillation technique: To generate triplets, using the SPLADE which is already trained and has some knowledge of relevant ranking. In this method, we can take advantage of harder negatives in the training process to improve the model efficiently.

Let’s examine the performance of SPLADE

According to the BEIR evaluation leadership board on EvalAI,

SPLADE+BM25 and SPLADE demonstrate competitive performance compared to plain BM25.

https://eval.ai/web/challenges/challenge-page/1897/leaderboard/4475

References:

Thank you for reading. Stay tuned!

--

--