The aRt of RAG Part 1: Textract, MongoDB Atlas and LlamaIndex
Introduction
With the advent of OpenAI’s ChatGPT and Facebooks release of Llama there has been an explosion in the development and use of Large language Models (LLM). LLM’s are great, but from an enterprise perspective, if you can’t use enterprize data with them, they quickly become a novelty and not much else. Due to the fixed knowledge base of LLM’s, Retrieval Augmented Generation (RAG), has become the go to method for enabling these models with unseen data. This in turn has lead to the expansion in use of vector stores and semantic search, to find and feed LLM’s with relevant data.
Most published examples of RAG rely on pdf readers of some description to extract text, which is then chunked and loaded into a vector store. However this does not work for pdfs that contain images. In this instance the text has to be extracted from the image before it can be used.
Also at the time of this writing, Mongo released an upgrade to Atlas that allows it to be used as a vector store. So, in this article I’ll cover combining an OCR engine (Textract) with MongoDB Altas as our vector store for RAG. The glue that brings all this together is LlamaIndex.
Why LlamaIndex? I found working with LlamaIndex easier than Lanchain, is more focused and includes some really nice feastures that make it, in my mind, the go to tool right now for RAG.
Although RAG is conceptually is quite straight forward, the ‘R’ part is quite challenging to master. In this article I want to focus on the ‘R’ of RAG. This is the first in series of articles I intend to write exploring and elaborating on building advanced RAG systems.
Text extraction
Textract has been around quite a while now. It was initially a bare bones api call but has evolved over time to include more and more features. AWS followed this up with building a utility library which makes using Textract quite easy. This utility is called Textractor. The only other thing you will need is an AWS account to setup an S3 bucket and the Textract service. Once you have these you’re good to go. Sample data has been sourced from DocBank. Lets begin with basic document text extraction first. Here is a cropped section of the DocBank image:
Using the following code snippet as a scaffold we can turn this into machine readable text:
import trp
import json
from trp.trp2 import TDocument, TDocumentSchema
from trp.t_pipeline import order_blocks_by_geo
docname = "s3://bucketname/10.tar_1701.04170.gz_TPNL_afterglow_evo_8.jpg"
textract_json = call_textract(input_document=docname, features=[])
#save json to a file called 10.tar_1701.04170.gz_TPNL_afterglow_evo_8.json
file = '10.tar_1701.04170.gz_TPNL_afterglow_evo_8.json'
with open( file,'rt') as handle:
doc = json.load(handle)
t_doc = TDocumentSchema().load(doc)
ordered_doc = order_blocks_by_geo(t_doc)
trp_doc = trp.Document(TDocumentSchema().dump(ordered_doc))
texts = [page.text for page in trp_doc.pages]
Vector database
MongoDB recently added the ability to use Atlas as a vector data store and search engine. This means you can use any embedding engine you like to create your vectors, store them and then use them for search. The nice thing about this is, you have one integrated system of data and search. Less to deploy, manage and maintain. To get started all we need to do is setup a MongDB account. The free tier gives you access to Atlas.
LlamaIndex
The glue. Once we have our text extracted and our MongoDB account created we can start building our vector database. To generate our text to be inserted into Mongo, I used LlamaIndex, which provides some nice functions for chunking text. To demonstrate its flexibility I’ve also used Huggingface sentence transformers for generating the embeddings. Combining the chunking and embeddings we can populate our index
def atlas_llamaindex(client, db_name, collection_name, sentence_nodes, index_name, service_context):
'''
reference
https://gpt-index.readthedocs.io/en/latest/module_guides/storing/index_stores.html
search index Mongo definition
{
"fields":[
{
"numDimensions": 768,
"path": "embeddings",
"similarity": "euclidean",
"type": "vector"
}
]
}
Parameters
----------
client : TYPE
DESCRIPTION.
db_name : TYPE
DESCRIPTION.
collection_name : TYPE
DESCRIPTION.
index_name : TYPE
DESCRIPTION.
service_context : TYPE
DESCRIPTION.
service_context : TYPE
DESCRIPTION.
sentence_nodes : TYPE
DESCRIPTION.
index_name : TYPE
DESCRIPTION.
service_context : TYPE
DESCRIPTION.
Returns
-------
index : TYPE
DESCRIPTION.
'''
mongodb_client = client
index_store = MongoIndexStore.from_uri(uri=mongo_uri, db_name=db_name)
vector_store = MongoDBAtlasVectorSearch(mongodb_client, db_name=db_name, collection_name=collection_name, index_name='llama_index')
storage_context = StorageContext.from_defaults(vector_store=vector_store, index_store=index_store)
index = VectorStoreIndex(sentence_nodes, storage_context=storage_context, service_context=service_context)
return index
Logging into Mongo you should now see 42 documents in your database. These are the 42 text nodes (chunks) generated by the sentence splitter.
Search
Now we have our data in our database we can search it. To do that we have to manually create a search index in Mongo (currently this is the only way to construct these search indexes). Go to “Database”, then select “search”
There you can create an index. Make sure you associate it with your collection, in this case db_name=’medium01', collection_name=’test_collection’, and index_name = ‘test_index’. Then add in the mapping definition as shown below.
Note: in the above image the index name is “default”. For this example remember to change the name to “text_index”. Once its finished creating the index you should have the following index as seen below
Update: Mongo have now deprecated the knnVector type. The JSON should now look like this:
{
"fields":[
{
"numDimensions": 768,
"path": "embedding",
"similarity": "euclidean",
"type": "vector"
}
]
}
The final step is to query our index. To do that we use the following function:
def atlas_search(client, db_name, collection_name, index_name, service_context, similarity_top_k, query):
'''
Parameters
----------
client : TYPE
DESCRIPTION.
db_name : TYPE
DESCRIPTION.
collection_name : TYPE
DESCRIPTION.
index_name : TYPE
DESCRIPTION.
service_context : TYPE
DESCRIPTION.
similarity_top_k : TYPE
DESCRIPTION.
query : TYPE
DESCRIPTION.
Returns
-------
response : TYPE
DESCRIPTION.
'''
mongodb_client = client
index_store = MongoIndexStore.from_uri(uri=mongo_uri, db_name='medium01')
store = MongoDBAtlasVectorSearch(mongodb_client, db_name=db_name, collection_name=collection_name, index_name=index_name)
storage_context = StorageContext.from_defaults(vector_store=store, index_store=index_store)
mindex = load_index_from_storage(storage_context=storage_context, service_context=service_context)
base_retriever = mindex.as_retriever(similarity_top_k = 4)
response = base_retriever.retrieve(query)
return response
We call the function with the following parameters and we’re done :)
db_name='medium01'
collection_name='test_collection'
index_name = 'test_index'
embed_model = HuggingFaceEmbeddings(model_name='distilbert-base-nli-stsb-mean-tokens')
service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=None)
query = "Our Monte Carlo model predicts that protons are easily accelerated beyond the knee \
in the cosmic ray gy density as the plasma expands downstream from the spectrum; the high magnetic fields"
similarity_top_k = 4
response = atlas_search(client, db_name, collection_name, index_name, service_context, similarity_top_k, query)
Conclusion
MongoDB’s vector database is is still quite new. There are a few things missing from the api, but I’ve no doubt that those missing elements will be added over time. I’m very impressed with how fast Mongo is, and how quickly LlamaIndex has added support for Mongo.
Stay tuned for the next in the series.