Hybrid Search: Combining BM25 and Semantic Search for Better Results with Langchain

Published in

LanceDB

4 min readDec 9, 2023

Have you ever wondered how search engines find exactly what you’re looking for? Most often a combination of keyword matching and semantic search is used to search for user quries. This is known as hybrid search. Let’s see how we can implement a simple hybrid search pipeline for document search.

Understanding BM25:

BM25 is a ranking algorithm used in information retrieval systems to estimate the relevance of documents to a given search query.

What it does: It looks at how often your search words appear in a document and considers the document’s length to provide the most relevant results.
Why it’s useful: It’s perfect for sorting through huge collections of documents, like a digital library, without bias towards longer documents or overused words.

Key elements of BM25:

Term Frequency (TF): This counts how many times your search terms appear in a document.
Inverse Document Frequency (IDF): This gives more importance to rare terms, making sure common words don’t dominate.
Document Length Normalization: This ensures longer documents don’t unfairly dominate the results.
Query Term Saturation: This stops excessively repeated terms from skewing the results.

Overall

score(d, q) = ∑(tf(i, d) * idf(i) * (k1 + 1)) / (tf(i, d) + k1 * (1 - b + b * (dl / avgdl)))

When is BM25/ Keyword search Ideal?

Large Document Collections: Perfect for big databases where you need to sort through lots of information.
Preventing Bias: Great for balancing term frequency and document length.
General Information Retrieval: Useful in various search scenarios, offering a mix of simplicity and effectiveness.

Practical Application: Building a Hybrid Search System

Imagine you’re crafting a search system for a large digital library. You want it not only to find documents with specific keywords but also to grasp the context and semantics behind each query. Here’s how:

Step 1: BM25 quickly fetches documents with the search keywords.
Step 2: VectorDB digs deeper to find contextually related documents.
Step 3: The Ensemble Retriever runs both systems, combines their findings, and reranks the results to present a nuanced and comprehensive set of documents to the user.

What Exactly is Hybrid Search?

Hybrid search can be imagined as a magnifying glass that doesn’t just look at the surface but delves deeper. It’s a two-pronged approach:

Keyword Search: This is the age-old method we’re most familiar with. Input a word or a phrase, and this search hones in on those exact terms or closely related ones in the database or document collection.
Vector Search: Unlike its counterpart, vector search isn’t content with mere words. It works using semantic meaning, aiming to discern the query’s underlying context or meaning. This ensures that even if your words don’t match a document exactly if the meaning is relevant, it’ll be fetched.

Follow along with this colab

Google Colaboratory

Edit description

colab.research.google.com

Let’s get to the code snippets. Here we’ll use langchain with LanceDB vector store

# example of using bm25 & lancedb -hybrid serch

from langchain.vectorstores import LanceDB
import lancedb
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.schema import Document
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.document_loaders import PyPDFLoader

# Initialize embeddings
embedding = OpenAIEmbeddings()

and load a single PDF.

# load single pdf

loader = PyPDFLoader("/content/Food_and_Nutrition.pdf")
pages = loader.load_and_split()

Create BM25 sparse keyword matching retriever

# Initialize the BM25 retriever
bm25_retriever = BM25Retriever.from_documents(pages)
bm25_retriever.k =  2  # Retrieve top 2 results

Create lancedb vector store for dense semantic search/retrieval.

db = lancedb.connect('/tmp/lancedb')
table = db.create_table("pandas_docs", data=[
    {"vector": embedding.embed_query("Hello World"), "text": "Hello World", "id": "1"}
], mode="overwrite")


# Initialize LanceDB retriever
docsearch = LanceDB.from_documents(pages, embedding, connection=table)
retriever_lancedb = docsearch.as_retriever(search_kwargs={"k": 2})

Now ensemble both retrievers, here you can assign the weightage to it.

# Initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, retriever_lancedb],
                                       weights=[0.4, 0.6])

# Example customer query
query = "which food needed for building strong bones and teeth ?
 which Vitamin & minerals importat for this?"


# Retrieve relevant documents/products
docs = ensemble_retriever.get_relevant_documents(query)

Using ensemble retriever it's trying to search each word in documents, such as strong bones & teeth as well as its also searching it in using lancedb which will find most similar documents based on similarity.

from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(openai_api_key="sk-yourapikey")

#if you want to use opensource models such as lama,mistral check this 
# https://github.com/lancedb/vectordb-recipes/blob/main/tutorials/chatbot_using_Llama2_&_lanceDB

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=ensemble_retriever)

query = "what nutrition needed for pregnant women  "
qa.run(query)

again here its searching the keyword — “ nutrition pregnant women” in the database using bm25 & returning the best matching results & similarly at the same time we are using lancedb for this. this is how it's more powerful to extract text.

below are answers from the traditional rag, you can check this in our repo the answers may vary based on different parameters, models, etc.

you can try this on colab with your pdf & use case.This is how you can use hybrid search to improve your search quality.

Explore More with Our Resources

Discover the full potential of hybrid search and beyond in our LanceDB repository, offering a setup-free, persisted vectorDB that scales on on-disk storage. For a deep dive into applied GenAI and vectorDB applications, examples, and tutorials, don’t miss our VectorDB-Recipes at https://github.com/lancedb/vectordb-recipes. From advanced RAG methods like Flare, Rerank, and HyDE to practical use cases, our resources are designed to inspire your next project .