A Simple RAG Developers Guide

Published in

CyberArk Engineering

7 min read3 days ago

As a software developer diving into a new project, I was lucky enough to get a task on researching the field of AI chatbots using Retrieval-Augmented Generation (RAG). Initially, I knew a bit about Large Language Models (LLMs), but RAG was entirely foreign to me. This article is your quick and easy guide to implementing RAG alongside LLM APIs for a simple proof-of-concept using Python.

Before RAG

When asking the LLM a specific question about Bangkok Airways financial statements, notice that it can not provide an answer. It responds that it can’t provide the requested data and recommends searching company records directly.

This is not the behavior we want. We want the LLM to be a master of the company’s financial records and capable of giving us detailed answers about them.

Question:
What was Bangkok Airways financial statements for the year of 2017?

Answer:
Unfortunately, I do not have access to Bangkok Airways' specific financial statements for the year 2017. 
As an AI assistant without direct access to a company's internal financial records, 
I cannot provide the detailed financial data you are requesting.

After RAG

This is where RAG comes in handy. Before diving into the theory and developers guide, lets notice the difference in the before and after LLM response. Though we still may need lots of fine-tuning in order to get an almost perfect response, the potential of RAG is quickly noticed here.

Question:
What was Bangkok Airways financial statements for the year of 2017?

Answer:
Based on the provided context, Bangkok Airways Public Company Limited reported the following financial statements for the year ended 31 December 2017:
From the Statement of Comprehensive Income:
- Consolidated revenue of 24,931 million baht 
- Consolidated profit before tax of 1,318.4 million baht
- Consolidated net profit of 846.4 million baht
- Separate (company only) profit before tax of 1,107.3 million baht 
- Separate net profit of 802.7 million baht
From the Statement of Financial Position:
- Consolidated total assets of 57,941.6 million baht
- Separate (company only) total assets of 53,690.5 million baht
The context provides details from Bangkok Airways' audited consolidated and separate financial statements for the year 2017, including the income statement, statement of financial position, cash flow statement and related notes.

TLDR: Code First

This Python code implements the basic concepts of RAG and LLM API usage. For those of you that have an AWS account and want a quick POC, just copy/paste/modify the variables here and give it a go…

import boto3
import json
from pathlib import Path

from langchain.prompts import PromptTemplate
from langchain_community.embeddings import BedrockEmbeddings
from langchain_community.llms.bedrock import Bedrock
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import TokenTextSplitter
from pypdf import PdfReader

PDF_FOLDER_PATH = ''
DB_NAME = ''

def convert_pdf_to_documents(file_path: str):
    documents = []
    pdf_search = Path(file_path).glob("*.pdf")
    pdf_files = [str(file.absolute()) for file in pdf_search]
    for file_path in pdf_files:
        reader = PdfReader(file_path)
        text = ""
        for page in reader.pages:
            text += page.extract_text() + "\n"

        text_splitter = TokenTextSplitter.from_tiktoken_encoder(
            encoding_name="cl100k_base", chunk_size=200, chunk_overlap=30
        )

        chunks = text_splitter.create_documents(texts=[text])
        documents.extend(chunks)

    return documents


def get_embeddings():
    bedrock_embeddings = BedrockEmbeddings(model_id="cohere.embed-english-v3", region_name="us-east-1")
    return bedrock_embeddings


def create_db():
    docs = convert_pdf_to_documents(PDF_FOLDER_PATH)
    embeddings_model = get_embeddings()
    db = FAISS.from_documents(docs, embeddings_model)
    db.save_local(DB_NAME)


def read_db():
    embeddings_model = get_embeddings()
    db = FAISS.load_local(DB_NAME, embeddings_model)
    return db


def invoke_claude_3_with_text(prompt: str):
    # Initialize the Amazon Bedrock runtime client
    client = boto3.client(
        service_name="bedrock-runtime", region_name="us-east-1"
    )

    # Invoke Claude 3 with the text prompt
    model_id = "anthropic.claude-3-sonnet-20240229-v1:0"

    try:
        response = client.invoke_model(
            modelId=model_id,
            body=json.dumps(
                {
                    "anthropic_version": "bedrock-2023-05-31",
                    "max_tokens": 1024,
                    "messages": [
                        {
                            "role": "user",
                            "content": [{"type": "text", "text": prompt}],
                        }
                    ],
                }
            ),
        )

        # Process and print the response
        result = json.loads(response.get("body").read())
        output_list = result.get("content", [])

        print(f"- The model returned {len(output_list)} response(s):")
        for output in output_list:
            print(output["text"])

        return result

    except Exception as err:
        print(f"Couldn't invoke Claude 3 Sonnet. Here's why: {err}")
        raise


def main():
    
    if not Path(DB_NAME).exists():
        create_db()
    
    db = read_db()

    template = """
        CONTEXT:
        {context}
        QUERY:
        {human_input}
        Use the provided context to answer the provided user query.
        Only use the provided context to answer the query.
        If you do not know the answer, response with "I don't know"
        """

    prompt = PromptTemplate(input_variables=["context", "human_input"], template=template)

    print('-'*60)
    while True:
        question = input("Ask a question:")
        if question == 'quit':
            break

        context = [doc.page_content for doc in db.similarity_search(question, k=10)]
        response = invoke_claude_3_with_text(prompt.format(context=context, human_input=question))
        print('-' * 60)


if __name__ == "__main__":
    main()

For Those of You Who Like Reading…

In this article we are exploring the world of Retrieval Augmented Generation (RAG). But before we dive in, lets read about LLMs, RAG, text extraction and splitting, embeddings, vector databases and AWS Bedrock.

What are LLMs?

LLMs are pre-trained language models with millions (or billions) of parameters.

They learn patterns, context, and semantics from vast amounts of text data. Examples of LLM models are GPT (OpenAI), Claude (Anthropic), and Gemini (Google).

Use cases for LLMs include:

Text generation (e.g., chatbots, content creation).
Sentiment analysis, translation, question answering.
Understanding context and context-based responses.

What Is RAG?

RAG is an AI framework for retrieving facts from an external knowledge base to ground LLMs on the most accurate and up-to-date information.

Simply put: Add your external information to the LLM by sending it with your input and telling the LLM to answer your query accordingly.

RAG combines two steps:

Retrieval: Retrieve relevant information (passages, documents).
Generation: Enhance LLM-generated responses using retrieved content.

Why RAG Matters:

LLMs alone may lack context or external knowledge.
RAG leverages existing information for more accurate and context-aware responses.
Fast way to teach the LLM.

PDF Text Extraction and Splitting

We start with getting the data we want to enhance the LLM with.
In this case that means extracting the text from a large PDF file.

This function is responsible for extracting the text from the file and splitting the text into chunks.

Chunking the text has many strategies and the process of fine-tuning can take time and a lot of research.

In the example below notice the chunk_size and chuck_overlap parameters. The overlap parameter will make sure that we don’t miss text information that might be relevant, before or after the chunk was made.

def convert_pdf_to_documents(file_path: str):
    documents = []
    pdf_search = Path(file_path).glob("*.pdf")
    pdf_files = [str(file.absolute()) for file in pdf_search]
    for file_path in pdf_files:
        reader = PdfReader(file_path)
        text = ""
        for page in reader.pages:
            text += page.extract_text() + "\n"

        text_splitter = TokenTextSplitter.from_tiktoken_encoder(
            encoding_name="cl100k_base", chunk_size=200, chunk_overlap=30
        )

        chunks = text_splitter.create_documents(texts=[text])
        documents.extend(chunks)

    return documents

A Word About Embeddings

Embeddings are high-dimensional vectors that capture the semantic meaning of words, sentences or even entire documents. The process of converting text into embeddings allows LLMs to perform various nlp tasks.

To be able to create, populate and then query our vector db we will need to embed our text, meaning convert it to vectors based on the embedding model we are using.

In this example we are using Bedrock which is an AWS service that gives us access to multiple LLMs and embedding models that are currently supported by AWS.

bedrock_embeddings = BedrockEmbeddings(model_id="cohere.embed-english-v3", region_name="us-east-1")
vector = bedrock_embeddings.embed_query("thanks for reading this")

# len(vector) == 1024
# [0.015563965, -0.012359619, -0.0011510849, ...]

FAISS

Faiss is a library for efficient similarity search and clustering of vectors.

Given a set of vectors, we can index them using Faiss. Then using another vector (the query vector), we search for the most similar vectors within the index.

Here is the code that is responsible for creating and loading Faiss. All the data is stored locally in the project, making this a very simple and easy vector database to start with.

def get_embeddings():
    bedrock_embeddings = BedrockEmbeddings(model_id="cohere.embed-english-v3", region_name="us-east-1")
    return bedrock_embeddings

def create_db():
    docs = convert_pdf_to_documents(PDF_FOLDER_PATH)
    embeddings_model = get_embeddings()
    db = FAISS.from_documents(docs, embeddings_model)
    db.save_local(DB_NAME)

def read_db():
    embeddings_model = get_embeddings()
    db = FAISS.load_local(DB_NAME, embeddings_model)
    return db

How to Retrieve Similar Documents

In similarity searching, we often have a query vector that we want to compare against a database of vectors. The goal is to retrieve a set of database vectors that are similar to the query vector.

The similarity search function measures similarity using dot product:

Imagine you have two vectors (think of them as arrows) in space. These vectors have both magnitude (length) and direction.
The dot product is a way to combine these vectors and measure how much they align with each other.

In the code k=10, the amount of vectors (chunks of text) that will return are 10. However after fine-tuning, this might not be the best value for your own use case.

context = [doc.page_content for doc in db.similarity_search(question, k=10)]

Interacting with Claude Using Amazon Bedrock

Amazon Bedrock is a fully managed service that makes it easy to use foundation models from third-party providers and Amazon. We use boto3 with Bedrock to invoke a chosen model (e.g. Claude), and output the result. This function was taken from AWS examples.

def invoke_claude_3_with_text(prompt: str):
    # Initialize the Amazon Bedrock runtime client
    client = boto3.client(
        service_name="bedrock-runtime", region_name="us-east-1"
    )

    # Invoke Claude 3 with the text prompt
    model_id = "anthropic.claude-3-sonnet-20240229-v1:0"

    try:
        response = client.invoke_model(
            modelId=model_id,
            body=json.dumps(
                {
                    "anthropic_version": "bedrock-2023-05-31",
                    "max_tokens": 1024,
                    "messages": [
                        {
                            "role": "user",
                            "content": [{"type": "text", "text": prompt}],
                        }
                    ],
                }
            ),
        )

        # Process and print the response
        result = json.loads(response.get("body").read())
        output_list = result.get("content", [])

        print(f"- The model returned {len(output_list)} response(s):")
        for output in output_list:
            print(output["text"])

        return result

    except Exception as err:
        print(f"Couldn't invoke Claude 3 Sonnet. Here's why: {err}")
        raise


response = invoke_claude_3_with_text("my query")

A 5-Step Method for Implementing RAG

Since the title of this article contains the word SIMPLE, let’s sum this up simply:

1. Choose a relevant data source according to your use case. This could be a collection of text documents, articles, or any relevant content. Embed and save this data to a vector database.

2. Each piece of information is represented as a numerical vector.

3. When a user inputs a question, query the vector database for similar vectors. These vectors serve as context.

4. Armed with the context, pass it along with the user’s question to the LLM.

5. LLM generates a response based on the provided context.

Recommended research for next steps:

Rerankers (reference)
Advanced Chunking (reference)
Parent Document Retriever (reference)

PDF Reference — bangkokAirways