Augmenting Large Language Models with Verified Information Sources: Leveraging AWS SageMaker and OpenSearch for Knowledge-Driven Question Answering

17 min readApr 9, 2023

TLDR: This article offers a detailed guide on how to enhance the accuracy of large language models (LLMs) by incorporating custom domain knowledge into the text generation process. The article will provide a high-level architecture blueprint along with code samples to develop a prototype web application for retrieval-enhanced question answering. The prototype uses two LLMs from SageMaker JumpStart Foundation Model Hub — a text embedding model for semantic search and a text generation model for generative question answering (QA). It uses AWS OpenSearch Service to store and index data and perform vector search. The post covers essential concepts such as the differences between closed and open-domain QA, reading comprehension, dense passage retrieval (DPR), and retrieval augmented question answering. The ultimate goal of this article is to enable readers to create a working prototype for factual generative question answering for any domain at scale. By following this guide, you will learn how to integrate your own custom domain knowledge into LLMs to achieve factual and precise text response generation and reduce the likelihood of hallucinations. You can find a GitHub repository containing example notebooks for the prototype solution at the following location.

Knowledge-augmented LLMs — Image generated with Stable Diffusion

Large language models (LLMs), commonly referred to as foundation models, are engineered to understand and produce text that closely mimics human language. These foundation models undergo training on vast quantities of textual data, enabling them to perform a broad array of natural language processing (NLP) tasks, such as summarization, question answering, paraphrasing, and numerous others, with exceptional precision and effectiveness.

Although LLMs offer many benefits for NLP tasks, they may not always provide factual or precisely relevant responses to specific domain use cases. This limitation can be especially crucial for enterprise customers with vast proprietary data who require highly precise and domain-specific answers. The difficulty for organizations seeking to improve LLM performance for their customized domains lies in effectively integrating their proprietary domain information into the LLM.

In this article, we will provide a walk-through of a high-level solution that demonstrates how to augment LLMs with domain-specific verified information sources.

Types of Question Answering

Question answering (QA) is a key task in NLP that aims to automatically answer questions posed in natural language. There are various types of QA systems, but they can generally be grouped into two main categories: (i) open-domain QA and (ii) closed-domain QA.

Open-domain QA systems are designed to answer questions on any topic and rely on vast amounts of unstructured text data available publicly on the internet e.g., Wikipedia. In contrast, closed-domain QA systems are specialized for specific domains, such as healthcare or legal, and require a structured knowledge base for accurate responses. Another approach to QA is information retrieval (IR)-based QA, which involves retrieving relevant documents from a corpus and extracting answers from them. This method can be used in both open and closed-domain QA systems, depending on the type of corpus used.

In addition to open and closed-domain QA, there are two other types of QA tasks: extractive and abstractive. Extractive QA extract answers directly from a given text or corpus in the form of spans, while abstractive QA generate answers using natural language generation techniques, often paraphrasing or summarizing information from the text.

Pre-training vs Domain Adaptation vs Task-specific Fine-tuning

In the realm of closed-domain QA systems, three common approaches can be used to accurately answer user queries based on a large corpus of textual information.

The first approach involves fine-tuning a LLM using factual knowledge that is represented as question answer (prompt completion) pairs. This fine-tuning process, known as prompt-based learning, is supervised, and normally involves updating the model through gradient descent. The format of the data used for fine-tuning may be specific to the task, the neural architecture of the model, and the tokenization strategy and special tokens used. Typically, this approach does not require a large amount of data and is run for a small number of epochs.

The second approach is domain adaptation, which differs from fine-tuning (see figure above). In this approach, the original base model is further pre-trained in a self-supervised manner with domain-specific unlabeled data to update the model through gradient descent. This approach usually requires a larger amount of data, a custom vocabulary, and tokenizer. This approach leads to the modification of the LLM to be more aligned with the domain, producing responses that are domain-centric. This topic was extensively discussed in our previous articles.

The final and the third approach involves augmenting LLMs with external custom domain knowledge through information retrieval (IR). In this approach, a knowledge base containing domain-specific documents is used together with an IR mechanism that retrieves relevant pieces of information such as passages or sections, referred to as “context.” This context retrieval process can be facilitated through various mechanisms, such as keyword-based or fuzzy string matching techniques, term frequency-inverse document frequency (TF-IDF) based retrieval, or the probabilistic information retrieval model known as Okapi BM25. Another advanced approach involves leveraging vector search, wherein textual data is transformed into semantically rich, contextualized embeddings via a sophisticated text embedding model, consequently enabling efficient and accurate information retrieval.

Types of Question Answering (Lewis et al., 2020)

The figure above illustrates the third approach (IR-based QA) in a left-to-right orientation. Let’s dive a little deeper into this paradigm.

On the right, the most popular method involves leveraging a LLM directly through prompt engineering and performing in-context learning. Given that an LLM is trained on a vast amount of knowledge from the internet, the model captures all types of open-domain knowledge within its weights. This knowledge is parametric, meaning the knowledge is encoded in the parametric brain (weights) of the LLM. However, drawbacks of this method include potential hallucinations and responses that lack depth, especially when tailored to a specific domain such as the legal, biomedical, etc.

The leftmost architecture utilizes an external knowledge base, which can contain open-domain information like crawled Wikipedia articles or closed-domain knowledge such as specification documents from your organization’s internal data lake. This approach leverages a retrieval mechanism to index and query this data, using either keyword-based or vector-based methods (ranking functions like BM25 versus similarity metrics like cosine similarity are used depending on the methodology). When dense vectors are employed, the top relevant passages (context) is retrieved via a process also known as Dense Passage Retrieval (DPR). Subsequently, a reader component is utilized to select the answer span, without the involvement of an LLM. This technique is referred to as reading comprehension.

DPR is an advanced technique in NLP and IR that enhances the performance of traditional document retrieval systems. It involves generating dense vector representations of each passage or sentence within a document, and using these representations to rank and retrieve relevant passages based on a given query. DPR typically leverages an LLM (a text embedding model without the pre-training head).

The middle architecture integrates elements from both the leftmost and rightmost methodologies, creating a balanced approach that combines a retriever utilizing DPR and a generator powered by an LLM. This method demonstrates enhanced resilience when faced with noisy or incomplete input data. Due to its ability to extract pertinent information from a corpus, the system can provide sensible responses even when confronted with imprecise or fragmented input data. This characteristic is especially advantageous in real-world scenarios where data quality can often be compromised or incomplete. Our article will primarily concentrate on this hybrid approach.

Solution Overview

The solution for knowledge augmentation follows a high-level architecture illustrated in the figure above, comprising several components. Our focus is on building a closed-domain question-answering (QA) solution in an enterprise setting, where we can leverage custom domain knowledge in the form of documents stored in an S3 bucket. We will rely on two AWS services primarily for our solution — I) Amazon SageMaker JumpStart and II) AWS OpenSearch service.

SageMaker JumpStart is a feature within the Amazon SageMaker, which is a fully managed machine learning platform. JumpStart serves as a model hub or a zoo, encapsulating a broad array of deep learning models for both text and vision. With over 500 models, its model hub comprises of both public and proprietary models from AWS’s partners like AI21, Cohere, Lyra, and Stability AI. It also hosts foundation models solely developed by Amazon, such as AlexaTM and more.

SageMaker JumpStart Foundation Model Hub

AWS OpenSearch Service is a distributed, community-driven, Apache 2.0-licensed, open-source search and analytics suite used for a broad set of use cases like real-time application monitoring, log analytics, and website search. OpenSearch is powered by the Apache Lucene search library, and it supports a number of search and analytics capabilities such as k-nearest neighbors (KNN) search, SQL, Anomaly Detection, Machine Learning Commons, full-text search, and more.

Step 1: Create SageMaker endpoints

The first step of our proposed architecture involves taking the corpus of documents in S3 and encoding them using a text embedding model to create contextualized embeddings from these documents. For our proposed solution, we utilize the GPT-J 6B Embedding FP16 model from SageMaker JumpStart. This model is hosted as a SageMaker endpoint, allowing for synchronous real-time inference.

This is a transformer-based model without a text generation model head. It takes a text string as input and produces an embedding vector with 4096 dimensions. While the a typical transformer-based model produces an embedding for every token in the input sequence, this model uses mean pooling, or an element-wise average, to aggregate these token embeddings into a sequence embedding.

This model loads a 16-bit quantized version of the original model by specifying the half-precision dtype, torch.float16. By using half precision, this model consumes less GPU memory and performs faster inference than the full precision version. For more information, please view the Hugging Face documentation for FP16 optimization.

Dataset:

We will be utilizing a legal dataset from OpenAIRE that comprises Supreme Court case documents from India and the United Kingdom (U.K.). The dataset contains approximately 8,000 legal judgment documents.

To facilitate the demonstration, we have provided a sub-sample of this dataset in the accompanying GitHub repository associated with this article. You can download the entire dataset from here. Example notebook that demonstrates how to create a SageMaker endpoint with the text embedding model can be found here. You can also pick and choose other embedding models from SageMaker JumpStart Foundation Model Hub as per your requirement. The endpoints can also be created in a no-code fashion with a single click via SageMaker JumpStart UI inside SageMaker Studio as shown below without writing any code.

No-code interface for easily deploying LLMs within SageMaker JumpStart

A snapshot of the code sample to deploy the text embedding model as a SageMaker endpoint is show below. SageMaker endpoints are a scalable, highly available, low-latency, customizable, and secure solution for deploying LLMs. They optimize inference performance, integrate with other AWS services, and provide built-in security features. With SageMaker endpoints, you can scale both vertically and horizontally to meet your SLAs.

...
MODEL_ID = 'huggingface-textembedding-gpt-j-6b-fp16'  
MODEL_VERSION = '*'
INSTANCE_TYPE = 'ml.g5.2xlarge'
INSTANCE_COUNT = 1
IMAGE_SCOPE = 'inference'
MODEL_DATA_DOWNLOAD_TIMEOUT = 3600  # in seconds
CONTAINER_STARTUP_HEALTH_CHECK_TIMEOUT = 3600
CONTENT_TYPE = 'application/json'

model = Model(image_uri=deploy_image_uri, 
              model_data=model_uri, 
              role=ROLE, 
              predictor_cls=Predictor, 
              name=endpoint_name, 
              env=env)

model.deploy(initial_instance_count=INSTANCE_COUNT, 
             instance_type=INSTANCE_TYPE, 
             endpoint_name=endpoint_name, 
             model_data_download_timeout=MODEL_DATA_DOWNLOAD_TIMEOUT, 
             container_startup_health_check_timeout=CONTAINER_STARTUP_HEALTH_CHECK_TIMEOUT)

We also want to select a large language model (LLM) for text generation and deploy it as an Amazon SageMaker endpoint. This endpoint will be used to generate relevant answers when prompted with the top matching context information. To accomplish this, we will utilize a 6 billion parameter LLM from AWS partner Cohere, which is available through the JumpStart Foundation Model Hub, as shown below.

Access Cohere text generation model from SageMaker JumpStart Foundation Model Hub

The Cohere text generation model can be deployed as an endpoint using either the user interface (UI) or programmatic access, similar to what we have seen previously. An example notebook containing the necessary code can be found via the provided link here. A snapshot of the code for creating the endpoint is shown below:

...
model_package_map = {
    'us-east-1': 'arn:aws:sagemaker:us-east-1:865070037744:model-package/cohere-gpt-medium-v1-4-825b877abfd53d7ca65fd7b4b262c421',
    'eu-west-1': 'arn:aws:sagemaker:eu-west-1:985815980388:model-package/cohere-gpt-medium-v1-4-825b877abfd53d7ca65fd7b4b262c421'
}
MODEL_PACKAGE_ARN = model_package_map[region]

model = ModelPackage(role=ROLE, 
                     model_package_arn=MODEL_PACKAGE_ARN, 
                     sagemaker_session=session, 
                     name=MODEL_NAME)

NUM_INSTANCES = 1
INSTANCE_TYPE = 'ml.g5.xlarge'
model.deploy(NUM_INSTANCES, 
             INSTANCE_TYPE, 
             endpoint_name=MODEL_NAME)

Text Segmentation

Given that legal documents tend to be lengthy and text generation models have limitations on their context window — the maximum number of tokens they can process, which includes the total number of tokens in the prompt and the completions, it is essential to segment these documents into smaller passages. It is important to note here, that the number of tokens does not directly correspond to the number of words in a chunk. Therefore, the appropriate tokenizer must be utilized to encode the text into tokens before segmenting the document into passages.

For this purpose, we recommend using tiktoken, a fast Byte Pair Encoding (BPE) tokenizer designed by OpenAI. From the library, we use the cl100k_baseencoder to encode the documents and divide them into chunks with a size equal to 768 tokens each. Below is a sample code for text segmentation. The complete notebook can be found here.

...
encoding = tiktoken.get_encoding('cl100k_base')
...
for doc_name, doc in tqdm(doc_iterator(DOC_DIR_PATH)):
    doc_id = doc_name.split('.')[0]
    tokens = encoding.encode(doc)
    chunks = []
    chunk_id = 1
    n_docs += 1
    for i in range(0, len(tokens), CHUNK_SIZE):
        chunk_tokens = tokens[i: i+CHUNK_SIZE]
        if not len(chunk_tokens) < 512:
            chunk = encoding.decode(chunk_tokens)
            with open(f'./data/chunks/{doc_id}_{chunk_id}', 'w') as f:
                f.write(chunk)
            chunk_id += 1
            n_passages += 1

Step 2: Create and index embeddings

Next, we want to take the text chunks and utilize the text embedding endpoint to generate contextualized embeddings for the individual passages. Once the passages are encoded, we will ingest these embeddings alongside the original passage and metadata into AWS OpenSearch for indexing.

Before creating our index, we need to set up a domain in OpenSearch Service. For the deployment type, select “Development and Testing,” and for the version, choose 7.10 (as shown below). For the instance type, the selection will depend on the volume of documents you want to index. In our demo application, we have opted for the r6g.large.search instance type with three nodes (the number of master nodes). This specific instance type requires EBS storage, so we have chosen a 10 GB volume size.

Ensure that you enable the “Auto-Tune” feature for the cluster. This will automatically make node-level adjustments that do not require downtime, such as tuning queues and cache sizes.

To facilitate the demonstration, we will configure our development cluster for public access with precise control over user permissions. Additionally, we will establish a master user account with a unique username and password (as shown below) that we will utilize in our notebook at a later stage.

Additionally, ensure that the domain access policy is configured to allow only fine-grained access control.

After creating the domain and setting the necessary credentials, you can update the config.yamlfile in your solution repository with the corresponding credentials, endpoint, and index name. Please see the example below for reference.

credentials:
  username: xxxxxxxx
  password: xxxxxxxx
domain:
  endpoint: https://semantic-search-xxxxxxxx.us-east-1.es.amazonaws.com
  index: legal-passages

Below is a sample code snippet demonstrating how to define the index mapping with a KNN vector field and create the index accordingly. Once the index is created, passages (chunks) can be encoded using SageMaker JumpStart GPT-J text embedding model and ingested into OpenSearch. For the complete notebook, please refer to this link.

...
TEXT_EMBEDDING_MODEL_ENDPOINT_NAME = 'huggingface-textembedding-gpt-j-6b-fp16-1680825746'
...
mapping = {
    'settings': {
        'index': {
            'knn': True  # Enable k-NN search for this index
        }
    },
    'mappings': {
        'properties': {
            'embedding': {  # k-NN vector field
                'type': 'knn_vector',
                'dimension': 4096  # Dimension of the vector
            },
            'passage_id': {
                'type': 'long'
            },
            'passage': {
                'type': 'text'
            },
            'doc_id': {
                'type': 'keyword'
            }
        }
    }
}
...
response = requests.put(URL, auth=HTTPBasicAuth(es_username, es_password), json=mapping)
...
for chunk_name, chunk in tqdm(chunk_iterator(CHUNKS_DIR_PATH)):
    doc_id, chunk_id = chunk_name.split('_')
    payload = {'text_inputs': [chunk]}
    payload = json.dumps(payload).encode('utf-8')
    
    response = sagemaker_client.invoke_endpoint(EndpointName=TEXT_EMBEDDING_MODEL_ENDPOINT_NAME, 
                                                ContentType='application/json',  
                                                Body=payload)
    
    model_predictions = json.loads(response['Body'].read())
    embedding = model_predictions['embedding'][0]
   
    document = { 
        'doc_id': doc_id, 
        'passage_id': chunk_id,
        'passage': chunk, 
        'embedding': embedding}
    
    response = requests.post(f'{URL}/_doc/{i}', auth=HTTPBasicAuth(es_username, es_password), json=document)

For our experiment, it takes approximately 4.5 hours to complete the entire process of encoding, ingesting, and indexing ~55,000 passages. At the end of the indexing process, the store.size was measured to be around 10.6 GB, while the pri.store.size was around 5.3 GB. These metrics are useful for monitoring and capacity planning purposes, as they indicate the amount of storage space used by the index and its primary shard on the node where it is located.

To improve the index time for OpenSearch, consider using bulk indexing to reduce the overhead of opening and closing connections for each document. You can also optimize the mapping to disable indexing on non-searchable fields, increase hardware resources, use faster hardware such as SSDs, and use shard sizing and allocation best practices. Properly sizing and distributing shards across multiple nodes can help ensure that each shard is around 20–50 GB in size and allows multiple nodes to process indexing requests simultaneously. Implementing these tips can significantly improve indexing performance and ensure efficient indexing processes.

We can also scale the text embedding SageMaker endpoint horizontally by adding more instances so as to increase the throughput of the encoding process. Distributing the workload across multiple machines allows for the simultaneous processing of more data, leading to faster encoding times.

Step 3, 4, 5: Perform vector search and find top-k matches

Once the index has been created, retrieving the top-k matching passages given a question becomes a straightforward process. First, the question is passed through the same text embedding endpoint to encode it into a 1x4096vector. This embedding is then used to query the OpenSearch cluster against the passage embeddings within the index. The similarity scores for the top-k matches are computed and returned as a result. In scientific literature, these steps are referred to as Dense Passage Retrieval (DPR). The code sample illustrating this process is provided below.

...
TEXT_EMBEDDING_MODEL_ENDPOINT_NAME = 'huggingface-textembedding-gpt-j-6b-fp16-1680825746'
TEXT_GENERATION_MODEL_ENDPOINT_NAME = 'cohere-medium-1680827379'
...
prompt = 'What is the definition of crime of battery?'
payload = {'text_inputs': [prompt]}
payload = json.dumps(payload).encode('utf-8')
response = sagemaker_client.invoke_endpoint(EndpointName=TEXT_EMBEDDING_MODEL_ENDPOINT_NAME, 
                                            ContentType='application/json', 
                                            Body=payload)
body = json.loads(response['Body'].read())
embedding = body['embedding'][0]
...
K=5
query = {
    'size': K,
    'query': {
        'knn': {
          'embedding': {
            'vector': embedding,
            'k': K
          }
        }
      }
    }
response = requests.post(URL, auth=HTTPBasicAuth(es_username, es_password), json=query)
response_json = response.json()
hits = response_json['hits']['hits']

Step 6: Augment text generation with top matched contexts

After retrieving the top-k matches, we can extract context from the hits collection and use it to generate prompts for abstractive question generation, as demonstrated in the provided code sample. Once the prompt has been engineered for question answering, we can invoke the second endpoint which hosts the text generation cohere model to obtain the answer generated for us.

To make the response more aligned with our custom domain, we can append reference metadata such as the original legal document, the origin passage, and the similarity score value to the generated answer. This approach enhances the overall quality of the response, making it more relevant and informative for the end user. Steps 3 to 6 are covered in this notebook.

...
for hit in hits:
    score = hit['_score']
    passage = hit['_source']['passage']
    doc_id = hit['_source']['doc_id']
    passage_id = hit['_source']['passage_id']
    qa_prompt = f'Context={passage}\nQuestion={prompt}\nAnswer='
    
    response = cohere_client.generate(prompt=prompt, 
                                      max_tokens=512, 
                                      temperature=0.25, 
                                      return_likelihoods='GENERATION')
    
    answer = response.generations[0].text.strip().replace('\n', '')
    logger.info(f'Answer:\n{answer}')
    logger.info(f'Reference:\nDocument = {doc_id} | Passage = {passage_id} | Score = {score}')

Alternative Architecture — Using only SageMaker

If you have limited documents to index and don’t want to use an external search engine like OpenSearch, you can use Amazon SageMaker to create and host the indices in-memory behind a real-time inference endpoint. In this alternative architecture, you can index the embedding for your corpus using the SageMaker K-Nearest Neighbors (KNN) algorithm. The KNN algorithm will start a training job to index the embedding knowledge data using Faiss as the underlying algorithm.

After creating the embedding index with SageMaker training and saving it to S3, it can be hosted in-memory on a SageMaker endpoint. The endpoint is responsible for taking the query embedding as input and returning the top K nearest indexes of the documents. The KNN training job uses a feature matrix represented by an N by P matrix, where N is the number of documents in the knowledge corpus, P is the embedding dimension (4096), and each row represents the embedding of a document. The labels are ordinal integers starting from 0. During inference, the endpoint retrieves the labels of the top K nearest documents with respect to the query and uses them as indexes to retrieve the corresponding textual documents.

This alternative architecture differs from the original one in that it involves an additional training step to create the embedding index. Furthermore, an extra endpoint is required to host the embedding index in addition to the existing endpoints required to host the text embedding and text generation models. The notebook demonstrating this setup can be found here.

Results:

Now that we have seen all the steps in detail, we can put together a web application as a working demo. We will be using Streamlit for this, and the application will run on your localhost. Instructions for setting up the application and the supporting script can be found here.

Below is a screenshot of the running application where we ask the system a question. The question is first encoded into an embedding using the text embedding endpoint, and this embedding vector is then searched against the embedding index of our passages to retrieve the top three matching passages. These passages are combined with the question and passed onto the LLM endpoint to generate answers. The reference to the source legal document, passage, and similarity score computed during vector search are also displayed. The default similarity function used in OpenSearch for KNN search is the L2 similarity function, which calculates the Euclidean distance between the query vector and the indexed vectors. Other similarity functions such as the inner product or the cosine similarity function can also be used by specifying them in the query. A higher score indicates that the document is more relevant to the query.

Final Remarks

In conclusion, retrieval augmentation has emerged as a significant area of research for improving the performance of LLMs in various domains. The use of various other similar methodologies such as RETRO, Atlas, In-Context RALM, RAG, and REALM can enhance the capability of LLMs with or without external knowledge sources. The present article focused on building an end-to-end pipeline for augmenting LLMs with external knowledge sources in a closed domain setting, specifically in the legal domain. However, the methodology can be extended to any domain-specific corpus, providing aligned responses with the incorporation of custom domain knowledge. With this approach, LLMs can become more factually accurate and better suited to their intended applications.

Thank you for taking the time to read and engage with this article. Your support in the form of following me and clapping the article is highly valued and appreciated. If you have any queries or doubts about the content of this article or the shared notebooks, please do not hesitate to reach out to me via email at arunprsh@amazon.com or shankar.arunp@gmail.com. You can also connect me on https://www.linkedin.com/in/arunprasath-shankar/

I welcome any feedback or suggestions you may have. If you are an individual passionate about ML on scale, NLP/NLU, and interested in collaboration, I would be delighted to connect with you. Additionally, If you are an individual or part of a startup, or enterprise looking to gain insights on Amazon Sagemaker and its applications in NLP/ML, I would be happy to assist you. Do not hesitate to reach out to me.