Bolstering LangChain’s MemoryVectorStore With Keyword Search

8 min readFeb 26, 2024

This is the fifth article in a series of articles about Lumos, a RAG LLM co-pilot for browsing the web. Reading the prior articles is recommended!

ChatGPT 4 prompt: “Generate an image of a magical vector store database that has combined two types of search capabilities: cosine similarity search and keyword fuzziness search. The database merges the results of both search types into a single result. The image should be in a futuristic cyberpunk style.”

Always Check Defaults! ✅

I previously highlighted Lumos’s online, in-memory RAG implementation and suggested that the approach was underrated and potentially useful for certain use cases. On the day-to-day, Lumos seemed to work fine for summarizing content and extracting details from webpages. However, there were plenty of instances when the app simply couldn’t produce the desired response. Every so often, simple questions about the content on a page would return low-quality or flat-out wrong answers. Users mentioned this and I experienced this myself while testing the extension.

There are many opportunities to optimize Lumos’s RAG workflow, but understanding which options to pursue requires a little bit more digging. After logging the retrieved documents from the vector store and “eyeballing” the search results, I was a bit surprised to learn that only 4 documents were being returned by default. This is the default number (K) set in LangChain’s VectorStoreRetriever. Additionally, the documents didn’t always align with the prompt that was issued. Again, just “eyeballing”. If 2 of the documents were not related to answering a question in the prompt, then the RAG pipeline effectively retrieved only half of K, which further reduced the chances of a high-quality response from the LLM.

A user brought the quality issue to my attention and proposed the idea of combining classical search techniques with the existing cosine similarity search performed by MemoryVectorStore. This seemed like a reasonable approach. This, together with adjusting the document retrieval count, felt like low-hanging fruit that would have the most bang for its buck.

Dynamic K ⚡

Why four?

A default value of 4 seems reasonable at first, but after more consideration, my thinking has changed. The number of documents to retrieve is dependent on factors such as chunk size, total document count, the LLM context window, and most importantly, the type of content and the use cases of the RAG application. The combination of these factors is simply too complex to rely on a static number. Because of this, my recommendation is to always explicitly set a value for K. If possible, the value should be dynamic based on the current request.

For an online, in-memory RAG implementation, where the average content of a webpage is relatively small, doubling the size of K from 4 to 8 does not seem out of the question. More flexibility is needed in this context.

Scaling Square Roots 🌱

To improve the breadth and range of LLM responses, the number of documents to retrieve should increase as the total number of documents in the vector store grows. However, the final number should be constrained enough so that token limits are not exceeded.

A simple yet elegant solution:

const computeK = (documentsCount: number): number => {
  return Math.ceil(Math.sqrt(documentsCount));
};

K is set to the square root of the total number of documents. If a typical Medium article is split into 20 chunks, the retriever should return 5 documents. If a Wikipedia article is split into 100 chunks, then 10 documents should be returned. The approach scales well across all factors (e.g. chunk size and overlap, total number of documents) and constraints (e.g. token limits). The solution is surprisingly effective.

After testing the change to K, I noticed immediate and obvious improvements in the LLM responses. It’s challenging to test this change scientifically, but it seems unlikely that the responses would be degraded when adding more context (i.e. documents) to the prompt. When setting K, it’s important to be thoughtful and deliberate.

Keyword Search with MemoryVectorStore 🔑

Lumos occasionally struggled with basic Q&A interactions. I realized that documents that contained the answer to a question in a prompt were not being returned by the RAG pipeline. It seemed like the default cosine similarity search was not consistent enough across a broad range of prompts.

It was worth exploring whether or not adding keyword search capability to the vector store would make a difference. The search technique is more direct, so it’s able to capture the low end of search use cases where nuance isn’t required.

EnhancedMemoryVectorStore 💪

EnhancedMemoryVectorStore is a direct subclass of MemoryVectorStore. The implementation is inspired by Supabase’s hybrid search retriever (SupabaseHybridSearch). Keyword search and hybrid search are added so the developer can determine which type of search is appropriate for their use case. The existing cosine similarity search feature is inherited from the parent.

The following code samples are snippets of the original source code:

export type EnhancedMemoryRetrieverInput<V extends EnhancedMemoryVectorStore> =
  BaseRetrieverInput &
    (
      | {
          vectorStore: V;
          k?: number;
          filter?: V["FilterType"];
          searchType?: "similarity";
        }
      | {
          vectorStore: V;
          k?: number;
          filter?: V["FilterType"];
          searchType: "keyword";
        }
      | {
          vectorStore: V;
          k?: number;
          filter?: V["FilterType"];
          searchType: "hybrid";
        }
      | {
          vectorStore: V;
          k?: number;
          filter?: V["FilterType"];
          searchType: "mmr"; // this search is not implemented, but is required for type compatibility
        }
    );

export class EnhancedMemoryRetriever<
  V extends EnhancedMemoryVectorStore,
> extends BaseRetriever {

  async _getRelevantDocuments(
    query: string,
    runManager?: CallbackManagerForRetrieverRun,
  ): Promise<DocumentInterface[]> {
    if (this.searchType === "similarity") {
      return this.vectorStore.similaritySearch(
        query,
        this.k,
        this.filter,
        runManager?.getChild("vectorstore"),
      );
    } else if (this.searchType === "keyword") {
      return this.vectorStore.keywordSearch(query, this.k, this.filter);
    } else {
      // hybrid search
      return this.vectorStore.hybridSearch(
        query,
        this.k,
        this.filter,
        runManager?.getChild("vectorstore"),
      );
    }
  }
}

export class EnhancedMemoryVectorStore extends MemoryVectorStore {
  _vectorstoreType(): string {
    return "enhanced-memory";
  }

  async keywordSearch(
    query: string,
    k: number,
    filter?: this["FilterType"],
  ): Promise<DocumentInterface[]> {
    const results = await this.keywordSearchWithScore(query, k, filter);
    return results.map((result) => result[0]);
  }

  async keywordSearchWithScore(
    query: string,
    k: number,
    filter?: this["FilterType"],
  ): Promise<[DocumentInterface, number][]> {
    ...
  }

  async hybridSearch(
    query: string,
    k: number,
    filter?: this["FilterType"],
    _callbacks?: Callbacks,
  ): Promise<DocumentInterface[]> {
    ...
  }

  asRetriever(
    kOrFields?: number | Partial<EnhancedMemoryRetrieverInput<this>>,
    filter?: this["FilterType"],
    callbacks?: Callbacks,
    tags?: string[],
    metadata?: Record<string, unknown>,
    verbose?: boolean,
  ): EnhancedMemoryRetriever<this> {
    ...
  }
}

EnhancedMemoryVectorStore is used in combination with a custom retriever called EnhancedMemoryRetriever (a subclass of BaseRetriever). The retriever routes a request to the desired search function (similaritySearch(), keywordSearch(), or hybridSearch()) and returns relevant documents like any other retriever. All interfaces for both subclasses are maintained so that EnhancedMemoryVectorStore is a drop-in replacement for MemoryVectorStore.

Example usage:

vectorStore = new EnhancedMemoryVectorStore(
  new OllamaEmbeddings({
    baseUrl: options.ollamaHost,
    model: options.ollamaModel,
    keepAlive: DEFAULT_KEEP_ALIVE,
  }),
);

// retriever is of type EnhancedMemoryRetriever
const retriever = vectorStore.asRetriever({
  k: computeK(documentsCount),
  searchType: "hybrid",
  callbacks: [new ConsoleCallbackHandler()],
});

Note: There are several challenges with this design and tradeoffs are made to allow EnhancedMemoryVectorStore to be a drop-in replacement for MemoryVectorStore. First, the implementation does not follow LangChain’s suggested guidelines for creating custom vector stores and custom retrievers. For beginners, I don’t recommend subclassing MemoryVectorStore directly. Second, there are type constraints in LangChain’s implementation that make subclassing existing non-abstract classes difficult or undesirable (e.g. interface for mmr search). Lastly, it’s not entirely clear that “keyword” search functionality belongs in a “vector store” abstraction. The name “vector store” constrains the abstraction to only vector search. Coincidentally, if MemoryVectorStore did not store the original documents directly, then keyword search (and other types of retrieval) would not be possible.
There’s a lot to think about here, but I’ll save the topic for later.

Fuzziness with Fuse 🥝

Fuse.js is a fuzzy-search library that Lumos uses to perform keyword searches. The library has document scoring capability, which makes it possible to rank and merge documents from keyword searches with documents from cosine similarity searches. The Fuse.js API is trivial to implement.

async keywordSearchWithScore(
  query: string,
  k: number,
  filter?: this["FilterType"],
): Promise<[DocumentInterface, number][]> {
  // filter documents (MemoryVector interface is not exported...)
  const filterFunction = (memoryVector: (typeof this.memoryVectors)[0]) => {
    if (!filter) {
      return true;
    }

    const doc = new Document({
      metadata: memoryVector.metadata,
      pageContent: memoryVector.content,
    });
    return filter(doc);
  };
  const filteredDocRecords = this.memoryVectors
    .filter(filterFunction)
    .map((memoryVector) => {
      return {
        metadata: memoryVector.metadata,
        pageContent: memoryVector.content,
      };
    });

  // keyword (fuzzy) search
  // fuse options: https://www.fusejs.io/api/options.html#options
  const fuseOptions = {
    includeScore: true,
    ignoreLocation: true,
    keys: ["pageContent"],
  };
  const fuse = new Fuse(filteredDocRecords, fuseOptions);
  const result = fuse.search(query, { limit: k });

  // return documents and scores
  return Promise.all(result).then((fuseResult) => {
    return fuseResult.map((fuseResult) => {
      return [new Document(fuseResult.item), fuseResult.score as number];
    });
  });
}

Blending Scores 🌪️

To combine search results from both search types, normalizing the document scores across both types is needed. To clarify, the scoring techniques from both search types are completely unrelated. The normalization process is an arbitrary method for decreasing the probability that one search type dominates the other.

async hybridSearch(
  query: string,
  k: number,
  filter?: this["FilterType"],
  _callbacks?: Callbacks,
): Promise<DocumentInterface[]> {
  const similarity_search = await this.similaritySearchWithScore(
    query,
    k,
    filter,
    _callbacks,
  ).then((docTuples) => {
    return docTuples.map((docTuple) => {
      // normalize cosine similarity score between 0 and 1
      const score = docTuple[1];
      const normalizedScore = (score + 1) / 2;
      docTuple[1] = normalizedScore;
      return docTuple;
    });
  });
  console.log("Similarity search results: ", similarity_search);

  const keyword_search = await this.keywordSearchWithScore(
    query,
    k,
    filter,
  ).then((docTuples) => {
    return docTuples.map((docTuple) => {
      // manually downscale Fuzziness score to better blend with cosine similarity score
      const score = docTuple[1];
      const normalizedScore = score * 0.7;
      docTuple[1] = normalizedScore;
      return docTuple;
    });
  });
  console.log("Keyword search results: ", keyword_search);

  return (
    Promise.all([similarity_search, keyword_search])
      .then((docTuples) => docTuples.flat())
      .then((docTuples) => {
        const picks = new Map<number, [Document, number]>();

        docTuples.forEach((docTuple: [Document, number]) => {
          const id = docTuple[0].metadata.docId;
          const nextScore = docTuple[1];
          const prevScore = picks.get(id)?.[1];

          if (prevScore === undefined || nextScore > prevScore) {
            picks.set(id, docTuple);
          }
        });

        return Array.from(picks.values());
      })
      // sort by score
      .then((docTuples) => docTuples.sort((a, b) => b[1] - a[1]))
      // select top k
      .then((docTuples) => docTuples.slice(0, k))
      // get documents
      .then((docTuples) => docTuples.map((docTuple) => docTuple[0]))
  );
}

First, cosine similarity scores are normalized to a value between 0 and 1 (in some cases, cosine similarity may be negative). Second, fuzziness scores are down-weighted by a factor of 0.7 (like 4, this is arbitrary and worth further experimentation and testing). Down-weighting the fuzziness scores from the keyword search was required so that it would not dominate the cosine similarity search. The approach produced decent results in my limited testing.

Lumos console logs: Document scores from hybrid search.

To rank and merge search results, IDs must be assigned to each document. LangChain’s Document interface contains a metadata attribute for storing this type of information.

Better But Inconclusive 🤔

For now, the results of the changes are inconclusive. But anecdotally, there seems to be a massive improvement in the responses from the LLM. Logically, it makes sense that adding keyword search would only improve results and not detract from them. LLMs are good at ignoring context when appropriate. Moving forward, keyword search should be considered more in RAG. It’s well-known and effective and probably matches existing user expectations for search-like functionality. It’s natural to ask a question and expect some part of the answer to contain bits and pieces of it. In fact, I’d bet that this is the expectation of many users of RAG applications already.