Basics of RAG (retrieval augmented generation): Create Your Own Custom GPT Prototype

7 min readNov 21, 2023

A futuristic scene of what absurdly optimistic people might imagine an AI-powered future might be

Introduction

You might question why you’d build your own custom GPT on AWS when OpenAI now offers the same feature. The answer is relatively straightforward:

Firstly, the recent happenings illustrate no company is entirely stable, and it might not be wise to put all your eggs into the OpenAI basket.
Secondly, security is an issue. OpenAI custom GPTs are either completely private or completely public. If you aim for effective user management and to serve as a platform, creating a distinct custom GPT for each client isn’t feasible. You’d prefer them to create their own on your platform.
Lastly, the key reason is the freedom to inject more expertise when you control your own custom GPT. For instance, if you wish to build a custom GPT as a marketing coach, you can instill your knowledge into training the LLM to be an efficient marketing consultant. Your value add would be your ability and expertise at training an LLM into being a better marketer. Customers would provide you with their audience and product data and your GPT gives them personalized responses based on their context.

You can’t solely be a wrapper for OpenAI. If you only develop a custom GPT with OpenAI’s platform and some cookie cutter content, it’ll typically perform well when everyone is asking similar queries. For instance, a diet coach would probably work pretty well. There’s only a finite number of known strategies to cut calories and increase caloric expense. But in some other areas, like marketing, answers are too context dependent for generic advice. What works in b2b marketing doesn’t work in marketing a product for teens, and what works in marketing for teens doesn’t work for boomers, and what works for boomers doesn’t work in marketing supplements, etc. You have the chance to capture more value by learning how to inject that knowledge into an LLM. You become the LLM’s context-specific meta-intelligence.

Prototype Version

The prototype version is super simple. The end goal of this system is as follows:

Pre-requisite (Step 0): Have a DB that contains your knowledge base, broken up into chunks and converted into embeddings.

User gives you some prompt/question
You turn the prompt into embeddings
Go to your DB and find similar embeddings. Then get the original text sentences associated with the similar embeddings. These will serve as the context you inject into your prompt.
Inject the context into your prompt, and ask ChatGPT to answer the user’s original prompt

Of course, before we can do that, we need that DB from step 0, where we store the embeddings with the original text, indexed for fast similarity search.

BTW, embeddings are a vectorial representation of the concepts/words behind sentences. The point of these vectorial representations is that you can quantify similarities between sentences. It works by way of some dark magic.

In order to lead us through our prototype, I’ll be using Seneca’s Morals of a Happy Life to train my custom GPT. You can use whatever you want. You can concatenate multiple books, YouTube video transcripts, podcast transcripts, or whatever you want into this text file which becomes your custom GPT’s knowledge base.

Creating an OpenSearch Cluster

Basically just follow the instructions here: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/createupdatedomains.html

Here are all my cluster settings in case you have any doubts:

{
    "DomainStatus": {
        "ARN": "xx",
        "AccessPolicies": "{\"Version\":\"2012-10-17\",\"Statement\":[{\"Effect\":\"Allow\",\"Principal\":{\"AWS\":\"*\"},\"Action\":\"es:*\",\"Resource\":\"arn:aws:es:_______:_____:domain/your_cluster_name/*\"}]}",
        "AdvancedOptions": {
            "indices.fielddata.cache.size": "20",
            "indices.query.bool.max_clause_count": "1024",
            "override_main_response_version": "false",
            "rest.action.multi.allow_explicit_index": "true"
        },
        "AdvancedSecurityOptions": {
            "AnonymousAuthDisableDate": null,
            "AnonymousAuthEnabled": false,
            "Enabled": true,
            "InternalUserDatabaseEnabled": false,
            "SAMLOptions": null
        },
        "AutoTuneOptions": {
            "ErrorMessage": null,
            "State": "ENABLED",
            "UseOffPeakWindow": false
        },
        "ChangeProgressDetails": null,
        "ClusterConfig": {
            "ColdStorageOptions": {
                "Enabled": false
            },
            "DedicatedMasterCount": null,
            "DedicatedMasterEnabled": false,
            "DedicatedMasterType": null,
            "InstanceCount": 1,
            "InstanceType": "m6g.large.search",
            "MultiAZWithStandbyEnabled": false,
            "WarmCount": null,
            "WarmEnabled": false,
            "WarmStorage": null,
            "WarmType": null,
            "ZoneAwarenessConfig": null,
            "ZoneAwarenessEnabled": false
        },
        "CognitoOptions": {
            "Enabled": false,
            "IdentityPoolId": null,
            "RoleArn": null,
            "UserPoolId": null
        },
        "Created": true,
        "Deleted": false,
        "DomainEndpointOptions": {
            "CustomEndpoint": null,
            "CustomEndpointCertificateArn": null,
            "CustomEndpointEnabled": false,
            "EnforceHTTPS": true,
            "TLSSecurityPolicy": "Policy-Min-TLS-1-0-2019-07"
        },
        "DomainId": "_________/your_cluster_name",
        "DomainName": "your_cluster_name",
        "EBSOptions": {
            "EBSEnabled": true,
            "Iops": 3000,
            "Throughput": 125,
            "VolumeSize": 10,
            "VolumeType": "gp3"
        },
        "EncryptionAtRestOptions": {
            "Enabled": true,
            "KmsKeyId": "________"
        },
        "Endpoint": "__________.es.amazonaws.com",
        "Endpoints": null,
        "EngineVersion": "OpenSearch_2.5",
        "IPAddressType": "ipv4",
        "LogPublishingOptions": null,
        "NodeToNodeEncryptionOptions": {
            "Enabled": true
        },
        "OffPeakWindowOptions": {
            "Enabled": true,
            "OffPeakWindow": {
                "WindowStartTime": {
                    "Hours": 0,
                    "Minutes": 0
                }
            }
        },
        "Processing": false,
        "ServiceSoftwareOptions": {
            "AutomatedUpdateDate": 0.0,
            "Cancellable": false,
            "CurrentVersion": "OpenSearch_2_5_R20230308-P4",
            "Description": "A newer release OpenSearch_2_5_R20230928-P2 is available.",
            "NewVersion": "OpenSearch_2_5_R20230928-P2",
            "OptionalDeployment": true,
            "UpdateAvailable": true,
            "UpdateStatus": "ELIGIBLE"
        },
        "SnapshotOptions": {
            "AutomatedSnapshotStartHour": null
        },
        "SoftwareUpdateOptions": {
            "AutoSoftwareUpdateEnabled": false
        },
        "UpgradeProcessing": false,
        "VPCOptions": null
    }
}

Now that you have a cluster, you can launch multiple opensearch indexes into it. You probably want a larger cluster than what I created, mine is super small.

Loading the Data Into the Cluster

The code below has its entry point at load_document_main. All it does is :

Create an opensearch index in the cluster you have above if none exists
Read a text file (this is your knowledge), break it up into chunks
For each chunk: convert each chunk into an embedding, and save it to the db

import re
from langchain.text_splitter import RecursiveCharacterTextSplitter
import requests
import json
import pandas as pd
import json
from helpers import get_aws_auth, get_embeddings_with_retry


def load_file_and_split_text(filename):
    with open(filename) as file:
        sanitized_text = file.read()
        # some sanitization code specific to my use case, you might not need this
        sanitized_text = re.sub(r'\n{2,}', '####', sanitized_text)
        sanitized_text = re.sub(r'\n{1}', ' ', sanitized_text)
        sanitized_text = re.sub(r'####', '\n\n', sanitized_text)
        sanitized_text = re.sub(r' {1,}', ' ', sanitized_text)
        sanitized_text = re.sub(r'“', '"', sanitized_text)
        sanitized_text = re.sub(r'”', '"', sanitized_text)
        sanitized_text = re.sub(r'[^a-zA-Z0-9?.,;\-!\s\n\'"]', '', sanitized_text)

        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size = 2000,
            chunk_overlap  = 500,
            length_function = len,
            add_start_index = True,
        )
        text_segments = text_splitter.create_documents([sanitized_text])
        text_segments = [t.page_content for t in text_segments]
        return text_segments

def get_embeddings_for_segments_and_store(text_segments, index_name, opensearch_endpoint_url, aws_region, start_idx=0):
    """
    Each of the text segments comes from your text splitter. In here, we get the embeddings from OpenAI and save the
    docs to our DB
    """
    for text_segment in text_segments:
        embeddings = get_embeddings_with_retry(text_segment)
        if embeddings:
            doc = {'text': text_segment, 'embeddings': embeddings }
            load_doc_to_open_search(doc, index_name, opensearch_endpoint_url, aws_region)

def get_embeddings_with_retry(text):
    ctr = 0
    while (ctr < 4):
        ctr += 1
        try:
            embedding = openai.Embedding.create(
                input=text,
                model="text-embedding-ada-002"
            )['data'][0]['embedding']
            return embedding
        except:
            sleep(1)
    print("Unsuccessful at getting embeddings for:", text)

def load_doc_to_open_search(text_to_embeddings_dict, index_name, opensearch_endpoint_url, region):
    auth=get_aws_auth(region)
    endpoint_url = f'{opensearch_endpoint_url}/{index_name}/_doc'
    headers = {'Content-Type': 'application/json'}
    json_doc = json.dumps(text_to_embeddings_dict)
    put_doc_response = requests.post(endpoint_url, headers=headers, data=json_doc, auth=auth)
    print(f"loaded doc to OpenSearch with status: {put_doc_response.status_code}")
    if (put_doc_response.status_code > 299):
        print("Error posting document:", put_doc_response.text)

def create_opensearch_index_if_not_exists(index_name, opensearch_endpoint_url, region):
    """
       Prerequisite: create an opensearch cluster in aws.
       You can just do it through click-ops as detailed here https://docs.aws.amazon.com/opensearch-service/latest/developerguide/gsgcreate-domain.html
       Except I recommend you pick cheaper settings (unless you wanna show this to customers):
       - No standby
       - Single node
       - 10gb volume
       - cheaper instance type (I picked m6.large)
    """
    index_url = f"{opensearch_endpoint_url}/{index_name}"
    auth=get_aws_auth(region)
    #check if index exists
    index_exists_response = requests.head(
        index_url,
        headers={"Content-Type": "application/json"},
        auth=auth
    )

    if index_exists_response.status_code == 200:
        print(f'Index "{index_name}" already exists. Skipping creation.')
        return
    else:
        print(f'Index "{index_name}" not found, creating...')

    mapping_properties = {
        'text': {'type': 'text'},
        'embeddings': {
            'type': 'knn_vector',
            'dimension': 1536,
            'method': {
                'name': 'hnsw',
                'space_type': 'cosinesimil',
                'engine': 'nmslib'
            }
        }
    }
    index_settings = {
        'settings': {
            'index.knn': True
        },
        'mappings': {
            'properties': mapping_properties
        }
    }
    create_result = requests.put(
        index_url,
        headers={"Content-Type": "application/json"},
        auth=auth,
        json=index_settings
    )

    if create_result.status_code == 200:
        print(f'Index "{index_name}" successfully created.')
    else:
        print(f'Error creating index: {create_result.content}')
        exit()


def load_document_main(index_name, opensearch_endpoint_url, aws_region):
    create_opensearch_index_if_not_exists(index_name, opensearch_endpoint_url, aws_region)

    text_segments = load_file_and_split_text("./Trimmed-Seneca-Morals-of-a-Happy-Life-Benefits-Anger-and-Clemency.txt")

    get_embeddings_for_segments_and_store(
        text_segments,
        index_name,
        opensearch_endpoint_url,
        aws_region
    )

def delete_index_if_exists(index_name, opensearch_endpoint_url, aws_region):
    index_url = f"{opensearch_endpoint_url}/{index_name}"
    auth=get_aws_auth(aws_region)
    #check if index exists
    index_exists_response = requests.head(
        index_url,
        headers={"Content-Type": "application/json"},
        auth=auth
    )

    if index_exists_response.status_code != 200:
        return

    delete_result = requests.delete(
        index_url,
        headers={"Content-Type": "application/json"},
        auth=auth
    )
    print("DELETE result: ", delete_result.status_code, delete_result.content)

The Driver Program

Finally, we’re ready to interact with our Seneca Gippidy. To that end, we need to write a program that:

Takes some prompt from the user
Converts this prompt to embeddings
Finds similar embeddings in our DB
Creates a big prompt with the text from our DB and the user’s query, then sends that to OpenAI’s completions API.


import json
import os
from helpers import get_aws_auth, get_embeddings_with_retry
import requests
import openai

openai.api_key = os.environ.get('OPENAI_API_KEY')

def get_text_context_from_opensearch(prompt_embeddings, index_name, opensearch_endpoint_url, region, max_hits):
    auth=get_aws_auth(region)
    query = {
        "size": 4,
        "query": {
            "script_score": {
                "query": {
                    "match_all": {}
                },
                "script": {
                    "source": "knn_score",
                    "lang": "knn",
                    "params": {
                        "field": "embeddings",
                        "query_value": prompt_embeddings,
                        "space_type": "cosinesimil"
                    }
                }
            }
        }
    }
    query_json = json.dumps(query)
    search_url = f'{opensearch_endpoint_url}/{index_name}/_search'
    headers = {'Content-Type': 'application/json'}

    # Send the POST request to the search endpoint with the query
    response = requests.post(search_url, headers=headers, data=query_json, auth=auth)
    if response.status_code != 200:
        print("Error getting similar embeddings: ", response.text)
        exit()

    response_json = json.loads(response.content)
    hits = response_json['hits']['hits']

    context_text = "/n".join([
        hit['_source'].get('text', '') for hit in hits[:max_hits]
    ])
    return context_text

def handle_question_with_added_context(question, index_name, opensearch_endpoint_url, region, max_context_hits = 3):
    prompt_embeddings = get_embeddings_with_retry(question)
    augmented_prompt_context = get_text_context_from_opensearch(prompt_embeddings, index_name, opensearch_endpoint_url, region, max_context_hits)
    gpt_response = openai.Completion.create(
        prompt="".join([
            """Answer the question based on the context below, and if the question can't be answered ba=sed on the context, say "I don't know"\n\nContext: """,
            augmented_prompt_context,
            "\n\n---\n\nQuestion: ",
            question,
            '\nAnswer: '
        ]),
        temperature=0.7,
        max_tokens=2500,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None,
        model='text-davinci-003',
    )
    print(json.dumps(gpt_response, indent=2))
    return gpt_response['choices'][0]['text'].strip()



# Program entry point
index_name = 'seneca-morals-of-a-happy-life'
opensearch_endpoint_url = 'https://search-____________.es.amazonaws.com'
aws_region = 'us-east-1' # or wherever

prompt = input("enter your question for Seneca:\n")
result = handle_question_with_added_context(prompt, index_name, opensearch_endpoint_url, aws_region)
print(f"prompt: {question}, \nanswer:", result)

Conclusion

None of this stuff is prod ready. Making it prod-ready would require, among other things:

A UI to create new knowledge bases
A worker to run load_document_main (this can’t be a regular http endpoint since it can run for several minutes depending on the size of the knowledge base)
An http endpoint to trigger the load_document_main worker from the UI
An http endpoint that wraps handle_question_with_added_context and streams the response back to the client
Authentication and authorization and permissions management — only certain users should use certain knowledge bases

If you’re interested in all of that stuff, hit me up on LinkedIn. I could be convinced to create an Educative course about it.