LlamaIndex in-depth practice: How to build a reliable storage subsystem with Mongodb Atlas.

Luoning Nici
8 min readSep 3, 2023

--

The reason for this article is that I encountered many troubles while using LlamaIndex to build a chatbot based on a private knowledge base. Some of these issues are related to LlamaIndex and LlamaHub themselves, and I have already submitted issues on GitHub. Others are related to understanding the tools and documentation, and I have spent a considerable amount of time resolving them.

In this article or series of articles, readers will see the following problems being solved:

  1. How to use MongoDBAtlasVectorSearch as an implementation of vector database. If you only refer to the code and examples in the official documentation (up to version 0.8.5), you will find that they do not really make your program run.
  2. How to correctly add and update documents to avoid duplicate content in the database.

1. Introduction

Personally, building a chatbot based on a personal knowledge base is a choice I have made out of necessity. I don’t want to rely on online services because personal digital assistant services haven’t matured enough to form standard service models yet, although this may change with the introduction of products like OpenAI’s. I also haven’t found suitable tools on GitHub yet. Quivr may be a solution in the future, but it’s not quite there yet. Lastly, I have a personal constraint as my location requires accessing OpenAI through a VPN, which adds complexity to my chatbot’s network configuration.

Nevertheless, this process has been enjoyable. I have been gradually delving into understanding the communication with LLM, especially ChatGPT. It feels unique, like learning how to converse with a future system.

2. LlamaIndex’s Storage System

If you are not familiar with LlamaIndex, please visit their official website, especially to understand the RAG mode. For LlamaIndex’s storage system, there is a great article on Medium that provides a detailed introduction, from which I also gained a lot of knowledge. In short, LlamaIndex uses the following pattern to store document information:

  1. The text of a document is divided into several Nodes, also known as “chunks”;
  2. Using the document ID as the primary key, the objects representing each document, mainly metadata such as file name and hash, are stored in the Document Store, along with the list of Nodes for that document;
  3. Using the Node ID as the primary key, the Node’s embedding is stored in the Vector Store.
LLamaIndex Storage Structure

So, where does the actual text reside? This question is somewhat complex. In theory, the text should be stored in the Document Store as an attribute field of the Node. However, in practice, this depends on the specific implementation of the Vector Store. If a particular implementation of the Vector Store, such as ChromaVectorStore, declares the property “stores_text” as True, then the text of the Node will be stored in the Vector Store together with the embedding, rather than in the Document Store. This design, originally intended to simplify read operations, can result in data inconsistency and lead to serious consequences. One of these consequences will be described in detail later in the document.

In addition to the aforementioned Document Store and Vector Store, LlamaIndex also has an Index Store, which is used to store query-oriented index information, and a Graph Store, which is used to store knowledge graph information. These two parts are not relevant to the content of this article.

3. Why MongoDB Atlas?

The usage of LlamaIndex can be quite simple, requiring only 5 lines of code to perform document queries based on a local directory. By default, LlamaIndex provides an in-memory storage implementation that can be used to run basic example code. However, it is evident that if our system expects to store a large number of documents and requires fast response times for queries and interactions, as well as constant accessibility, we need a reliable and scalable storage infrastructure.

Therefore, migrating storage to the cloud would be a reasonable solution. In LlamaIndex, the Document Store and Vector Store are implemented separately mainly because the Vector Store requires database support for vector similarity retrieval, a feature that is not present in most traditional databases.

Personally, I prefer not to use two separate cloud databases with corresponding API keys, as it would make the system unnecessarily cumbersome. So when I discovered that LlamaIndex has MongoDB implementations for the Index Store, Document Store, and Vector Store, I chose MongoDB Atlas as my preferred option.

Firstly, please register for MongoDB. Rest assured, the free version offered by MongoDB is more than sufficient to meet the initial requirements of your project. Only when your system’s data volume continues to grow will you need to consider either paying for MongoDB Atlas services or using your own database server.

4. Truly establishing our knowledge base

Before truly starting the coding process, it is necessary to plan the structure of the file system. Below is the directory structure that I am using:

.                    ------------------ home
./data/ ------------------ your documents here
./logs/ ------------------ logs directory
./private_config.ini ------------------ configuration file

I strongly recommend putting all sensitive information, such as API keys, or personalized information like database collection names, into a configuration file and ensuring that this file is not committed to the version control system. (You can submit an example configuration file, such as “sample_config.ini”, but it should not be actually used in the code.)

Build the StorageContext

In LlamaIndex, the StorageContext is used to construct a complete storage solution using specific instances implemented by each Store. The three MongoDB implementations for the Stores are:

  1. MongoDocumentStore
  2. MongoIndexStore
  3. MongoDBAtlasVectorSearch

As implied by these names, they are not designed in a unified manner, so their initialization also varies slightly.

import logging
import os
import sys
import openai
import pymongo
from configparser import ConfigParser
from llama_index import StorageContext
from llama_index.storage.docstore import MongoDocumentStore
from llama_index.storage.index_store import MongoIndexStore
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch


def get_mongo_storage() -> StorageContext:
config_parser = global_light.config
uri = config_parser.get('mongodb', 'URI')
db_name = config_parser.get('mongodb', 'DB_NAME')
collection_name = config_parser.get('mongodb', 'VECTOR_COLLECTION')
assert uri is not None, 'no db uri specified!'
assert db_name is not None, 'no db name specified!'
assert collection_name is not None, 'no vector collection name specified!'

client = pymongo.MongoClient(uri)
vector_store = MongoDBAtlasVectorSearch(client,
db_name=db_name,
collection_name=collection_name)
# mongodb index store and document store
index_store = MongoIndexStore.from_uri(uri=uri, db_name=db_name)
doc_store = MongoDocumentStore.from_uri(uri=uri, db_name=db_name)
storage_context = StorageContext.from_defaults(
docstore=doc_store,
index_store=index_store,
vector_store=vector_store
)
return storage_context

Initializing the database

Here, a fixed record is used to initialize each Collection within the storage Store.

import openai
from llama_index import Document, VectorStoreIndex
import os
import logging

import myutils

global_conf = myutils.global_light.config
INDEX_ID = global_conf.get('default', 'INDEX_ID')

assert os.getenv("OPENAI_API_KEY") is not None, "please set openai key!"
assert INDEX_ID, "no index id set!"

openai.api_key = os.getenv("OPENAI_API_KEY")
# ======= end of init ===============
# drop current mongodb database if there is one
myutils.clear_mongo_stores()
#
storage_context = myutils.get_mongo_storage()
documents = [Document(text="the first document.", doc_id="##Zero")]
# create new storages
index = VectorStoreIndex.from_documents(documents=documents,
storage_context=storage_context)
index.set_index_id(INDEX_ID)
storage_context.persist()
# some log
logging.info("llama indices initialized as '%s'.", INDEX_ID)

Loading documents from the database

documents = SimpleDirectoryReader(input_dir=filepath,
file_extractor=extractors,
filename_as_id=True).load_data()
storage_context = myutils.get_mongo_storage()
index = VectorStoreIndex.from_documents(documents=documents,
storage_context=storage_context)

Why is there no answer in the query?

According to the official documentation, by simply running the following two lines of code, you should be able to leverage ChatGPT’s query capabilities based on a personal knowledge base.

engine = load_index_from_storage(storage_context).as_query_engine()
print(engine.query("What did the author do growing up?"))

But if you really do that, you will find that every time you try to retrieve the top-k embeddings from MongoDB, you will get nothing:

DEBUG:llama_index.vector_stores.mongodb:Running query pipeline: [{'$search': {'index': 'default', 'knnBeta': {'vector': [0.010108518414199352, ...]}}}]
DEBUG:llama_index.vector_stores.mongodb:Result of query: VectorStoreQueryResult(nodes=[], similarities=[], ids=[])
DEBUG:llama_index.indices.utils:> Top 0 nodes:

Why is this happening? It took me a while to figure out that the knnVector indexing feature has not been officially released yet. Therefore, the official pymongo library (my version is 4.4.1) does not support creating such indexes, resulting in empty queries. Unfortunately, this point is not mentioned in the LlamaIndex documentation regarding MongoDBAtlasVectorSearch. Even in the indexing interface of Atlas for collections, there is no direct support for it.

Creating a knn index

Fortunately, Atlas provides a handy JSON Editor for creating indexes for collections. Therefore, follow the steps below to create a knn index for the field storing the embeddings and set the similarity calculation function:

  1. Log in to your Atlas account and locate the collection corresponding to the Vector Store. In my case, the setting is “llama_index/vectors”. If you are using the default names, it should be “default_store/default_db”.
  2. Create a query index for the collection.
  3. Then, navigate to the page for modifying the query index and choose the “Json Editor” mode for modification. Update the index with the following content:
{
"mappings": {
"dynamic": true,
"fields": {
"embedding": {
"dimensions": 1536,
"similarity": "cosine",
"type": "knnVector"
}
}
}
}

4. Wait for a few seconds to let the new index take effect.

Now, we can let GPT truly answer our questions!

DEBUG:openai:message='Request to OpenAI API' method=post path=https://api.openai.com/v1/completions
DEBUG:urllib3.connectionpool:https://api.openai.com:443 "POST /v1/completions HTTP/1.1" 200 None
Growing up, the author wrote short stories, experimented with programming on an IBM 1401, nagged his father to buy a TRS-80 computer, wrote simple games, a program to predict how high his model rockets would fly, and a word processor. He also studied philosophy in college, switched to AI, and worked on building the infrastructure of the web. He wrote essays and published them online, had dinners for a group of friends every Thursday night, painted, and bought a building in Cambridge.

5. Continuous Updating of the Database

In the above steps of loading the database, we used VectorStoreIndex.from_documents, a function that reads all files in the database every time and rebuilds the VectorStore.

Obviously, we need a database update mechanism that processes and incorporates only newly added or updated files when new data is added or existing files are updated.

Here is the implementation code for this feature:

def refresh_documents(filepath: str,
index: BaseIndex):
# extractors = {'.pdf': LocalPDFReader(concat_pages=True, hash_as_file=False)}
documents = SimpleDirectoryReader(input_dir=filepath,
file_extractor=extractors,
filename_as_id=True).load_data()
#show debug information about each document is exist in index or not
for doc in documents:
doc_id = doc.get_doc_id()
logging.debug(f"doc {doc_id} exists? {index.docstore.get_document_hash(doc_id)}")
updated = index.refresh_ref_docs(documents)
logging.debug("updated %s with status %s", filepath, updated)

In the commented third line of code, I used a LocalPDFReader to handle PDF files because the two PDF parsing classes provided by LlamaIndex had issues. The default PDFReader treated each page of the PDF file as a separate document, and the CJKPDFReader in LlamaHub had a bug that caused some PDF files to be repeatedly indexed as if the text was updated. I have reported this issue to LlamaIndex-hub and implemented a LocalPDFReader to work around it. However, discussing the details of this workaround is beyond the scope of this article.

6. Conclusion

I realize that interacting with LLM systems like ChatGPT is very different from traditional programming. We cannot fully control the output of LLM and we expect it to produce outputs beyond our own capabilities. It is this aspect that makes LLM systems seem “intelligent.”

However, before diving deeper into applications of LLM, we need to have the basic ability to communicate with LLM. LlamaIndex is an excellent tool that abstracts the application patterns of LLM, thus reducing the complexity of directly programming with the LLM API. But when we want to control and extend our applications, understanding the underlying implementation of tools like LlamaIndex becomes necessary.

Therefore, my plan is to start from the storage aspect and replace the default implementation of LlamaIndex with my preferred storage model. This way, I can gain a deeper and more complete understanding of its internal structure. Fortunately, I have encountered some issues and had the opportunity to solve them, which has allowed me to truly comprehend those details.

I cannot anticipate who the readers of this article will be, but I hope it can be helpful to those who share similar thoughts and aspirations.

7. References

1. LLamaIndex official website

2. Mongodb Atlas home

3. LlamaIndex: Comprehensive guide on storage

--

--

Responses (1)