Better Sparse Retrieval: Extending BM25 with Subwords

4 min readDec 18, 2023

The concept of Hybrid Search is having a moment in the wake of the popularity of Large Language Models (LLMs) and Retrieval Augmented Generation (RAG). With the recent rise in open source and as-a-service vector databases, we see many of them either natively supporting or offering an explainer on how to manage hybrid search for text retrieval (here, here, here, here).

Ensembles of sparse and dense retrieval methods are not new. However, much of the tooling is. Additionally, it has been shown that the BM25 ranking function is an improvement over TF-IDF .

A downside of a naive implementation of BM25 is the lack of attention to subwords. As the rank_bm25 python package suggests:

Note that this package doesn’t do any text preprocessing. If you want to do things like lowercasing, stopword removal, stemming, etc, you need to do it yourself.
The only requirements is that the class receives a list of lists of strings, which are the document tokens.

A query that references the key word ensembling will not match a potential target containing the word ensemble. Similarly, mismatched plurals, possessives, verb tenses, etc. will not result in retrieval hits as might be expected.

Common solutions are:

Stemming or lemmatization: Top down algorithms that convert words to their root form to be used as tokens.
Splitting at the subword level: Tokenization techniques that use parts of words as unique tokens.
Character n-grams: Splitting at the character level and including n-gram sets of unique characters as tokens.

Fans of fasttext, sklearn and LLM tokenization may be more inclined to avoid top down techniques like lemmatization in favor of subwords or character n-grams. In short: throw tokens at the model, let it sort it out — don’t let the preprocessing be too opinionated.

Improvement

Here is a quick improvement over naive BM25 that utilizes the tiktoken package from OpenAI:

import tiktoken

from typing import List

encoder = tiktoken.encoding_for_model("gpt-4")


def preprocess_func(text: str) -> List[str]:
    # Lowercase the input text
    lowered = text.lower()

    # Convert the lowered text into tokens
    tokens = encoder.encode(lowered)

    # Stringify the tokens 
    return [str(token) for token in tokens]

While this may lose some advantages of using the raw text (coverage of out-of-vocabulary language), it is trivial to implement and leverages the power of GPT-4’s tokenization which handles subwords. An added benefit is that GPT-4 is multilingual, where techniques like lemmatization are typically language specific.

For example:

>>> preprocess_func("Indemnification")
['485', '336', '77', '2461']

>>> preprocess_func("indemnify")
['485', '336', '77', '1463']

>>> preprocess_func("indemnifies")
['485', '336', '77', '9803']

There is 75% overlap between these 3 terms which otherwise would have 0% overlap when splitting on full words.

Given a user query: Is there an indemnification clause?, it’s reasonable to expect a target like this to match:

You will defend, indemnify, and hold harmless us, our affiliates and licensors, and each of their respective employees, officers, directors, and representatives from and against any Losses arising out of or relating to any third-party claim concerning: (a) your or any End Users’ use of the Services (including any activities under your AWS account and use by your employees and personnel); (b) breach of this Agreement or violation of applicable law by you, End Users or Your Content; or (c) a dispute between you and any End User.

Some applications might perform just as well leaning on the dense semantic search component of a hybrid search system to incorporate subword information. Nonetheless, this remains a straightforward way to achieve lift in the sparse component of the system.

Experiment

Here are results from a few iterations with a private evaluation set. The evaluation set contains a mix of simple and challenging retrieval tasks with legal contracts (hundreds of examples):

| Method                                             | recall@10 | recall@30 |
|----------------------------------------------------|-----------|-----------|
| Baseline                                           | 0.57      | 0.73      |
| Lowercasing                                        | 0.68      | 0.82      |
| Lowercasing + tiktoken encoding                    | 0.73      | 0.87      |
| Lowercasing + tiktoken encoding + stop word removal| 0.75      | 0.88      |

This implementation utilizes the BM25Retriever in the LangChain package by passing in a custom preprocess_func. The largest gain comes from lowercasing. We see additional lift by progressively layering on more techniques.

Appendix

As a bonus, the above experiment includes a pass with a multilingual stop word remover. It utilizes the stop-words and langdetect packages. Here is the implementation:

from stop_words import get_stop_words
from langdetect import detect


def remove_multilingual_stopwords(text: str) -> str:
    # Detect the language of the text
    try:
        lang = detect(text)
    except:
        return text

    # Get the list of stop words for the detected language
    try:
        stop_words = set(get_stop_words(lang))
    except:
        return text

    # Split the text into words and remove
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]

    # Reconstruct the text from filtered words
    return " ".join(filtered_words)

Better Sparse Retrieval: Extending BM25 with Subwords

Improvement

Experiment

Appendix

Written by Evan Harris