Augmenting LLM Applications with Database Access

Neel Phadnis
Notes and Points
Published in
5 min readJul 6, 2023
Photo by Jason Leung on Unsplash

This post discusses an extension to the approach discussed in the previous post Adding Similarity Search to incorporate similarity search in a database. Specifically, to remain within the context window size limitation of an LLM, a large document must be split into smaller chunks. The post then points out a low hanging fruit by integrating with the document loader interface in LangChain (and other similar LLM frameworks) without changes to the database.

Motivation

Many LLMs are pre-trained on finite public data, and naturally have no knowledge of any future events. External data access allows info that LLMs have not seen to be included as prompt context. This improves the currency, accuracy, and relevance of data, and alleviates hallucinations.

Another issue is that the context window of an LLM is limited (typically a few thousand tokens). For QA or summarization tasks involving larger documents, documents must be split into smaller chunks and processed with multiple LLM requests.

Extending the similarity search scheme

The previous post Adding Similarity Search discusses an architecture to add similarity search to a database by partitioning the document space into sub-clusters and adding an index for subcluster membership.

The context window sizes supported by LLMs continue to increase, and a large proportion of the documents may not need chunking for many tasks. For example, the average length of a Wikipedia article is about 650 words, and the “optimal” length for an online blog is thought to be around 1–2K words. Nevertheless, there will always be documents larger than the supported window of the LLM used. As noted above, chunking the stored documents to an appropriate chunk size is needed.

Two changes are needed to extend the scheme to multiple document chunks:

  1. Compute sub-clusters on document chunks instead of whole documents, and maintain sub-cluster -> [chunk-id] index.
  2. Maintain doc-id -> [chunk-id] mapping.

It should be easy to see how the data distribution and query processing as described in the mentioned post can be adapted to the above changes. This scheme can also include other similarity index types such as locality sensitive hashing (LSH) and hierarchical navigable small worlds (HNSW) without a significant change to the architecture.

Leveraging LLM frameworks

The low hanging fruit is to make documents in a database accessible to LLM applications through the database loader interface of LLM frameworks.

We look at two LLM frameworks: LangChain and LllamIndex and describe the work and point to the specifics in the respective documentation. The goal is to highlight the possibility and the ease with which it can be achieved. The database developers can decide whether the effort needed is worth it depending on the need of LLM applications to access the documents for QA, summarization, and other tasks.

It’s useful to point out that two levels of integration are possible: a low effort integration as described here provides access to documents stored in the database but does not use the database for storing either embeddings or indexes, or for executing similarity search. With a deeper integration, the database can be used as a vector store to store embeddings, indexes, and execute similarity searches.

LangChain document loader interface

LangChain is an orchestration framework for LLM applications with the following key capabilities:

  • A common interface and utilities for working with LLMs
  • Chains, which are sequences of calls to an LLM or a utility
  • Many integrations with tools

The Document abstraction includes text and associated metadata, and a document loader exposes a “load” method for loading data as documents from a configured source. The document loader interface is defined in the BaseLoader class here. It requires the following methods to be implemented:

def load(self) -> List[Document]
“””Load data into Document objects.”””

def load_and_split( self, text_splitter: Optional[TextSplitter] = None) -> List[Document]
“””Load Documents and split into chunks. Chunks are returned as Documents.

def lazy_load(self,) -> Iterator[Document]
“””A lazy loader for Documents.”””

The github repository has many examples of existing implementations, and creating a similar implementation to support a new database would be relatively straightforward.

After the database specific loader “YourLoader” is implemented, it can be used simply as follows::

from langchain.document_loaders import YourLoader
loader = YourLoader(‘doc-query’)
documents = loader.load()

As mentioned above, long documents must be split into chunks to fit into the context window. While there are a variety of splitters available, the application may need to provide a custom document splitter for optimal results.. Splitting chunks at arbitrary boundaries may not produce optimal results, and a splitter is preferred that takes into consideration boundaries of specific topics and subtopics for accurate embeddings and query results.

LlamaIndex data connector interface

LlamIndex is a data framework that allows you to augment LLM applications with external data. It has the following key concepts:

  • Data connector: Allows access to data in a data source, similar to LangChain’s data loaders. Data connectors are published through the Llama hub.
  • Structure: The framework provides index and graph structures on the data that can be easily used with the LLM tasks.
  • Retrieval and query interface: To get prompt context and augmented output respectively.

LlamaIndex also provides tools to allow easy integrations with other frameworks including LangChain and ChatGPT,

The database connector interface provides the basic integration of a database. Adding a connector implementation consists of functions for:

  • Server initialization and authentication
  • Executing a query and returning documents from the specified database instance. The query is database specific and can take one or more forms such as a query string, a list of document ids, a predicate, along with the fields containing document text and id.

The specific instructions for how to load a loader are here.

The available data connectors are published on the LlamaHub.

Beyond simple document retrieval

LangChain offers not only document loaders, but also VectorStore and Retriever abstractions. These interfaces provide storage and query capabilities and are defined in the github repository.

LlamaIndex similarly provides greater control on indexes and allows use as a vector store. You can find out from the github repository how your database can be integrated into LlamaIndex as a provider of the following:

  • Document stores: Store document nodes.
  • Index stores: Store index metadata.
  • Vector stores: Store and query embeddings.

--

--

Neel Phadnis
Notes and Points

Technologist, engineering leader, and outdoor enthusiast.