[Machine Learning] Unleashing the Power of RAG: Crafting an “Olympic-Ready” Knowledge Bot with OpenAI’s LLMs

Keshav Singh
6 min readSep 11, 2023

--

Design

General-purpose Transformers (GPT), in the context of large language models (LLMs), refer to versatile artificial neural networks designed to handle a wide range of natural language understanding and generation tasks. General-purpose Transformers are typically pre-trained on a massive corpus of text from the internet in an unsupervised manner. During pre-training, the model learns to predict the next word in a sentence or fill in masked-out words. After pre-training, the model can be fine-tuned on specific downstream tasks, such as text classification, translation, summarization, or question-answering.

Once trained these models serve general purposes and are immensely powerful.

On the base LLM model “Retrieval Augmented Generation” RAGs are systems designed to ground and contextualize solutions effectively with necessary prompt engineering for a purpose such as customer service, search/research, enterprise intelligence, legal, financial, medical analysis, and more. Tasks such as recommendation, search, and association can be effectively performed based on highly contextual knowledge.

To briefly describe a RAG solution, it constitutes a —

  1. ) Smart Retriever — We have a smart retriever which usually acts as a consultant providing relevance on the knowledge base by returning query-associated documents ranked/searched. Performing such overcomes the limitations of knowledge cutoff leading to LLM’s hallucination.
  2. ) Generator — The emitted contextual knowledge is then further forwarded to the generator model with the query and the LLM is equipped and has both the knowledge and query to desirably answer.
  3. ) Grounding (Prompt Engineering)— Grounding builds on both of the above to handle queries appropriately with the desired window of context, it much helps with the safety and bias of the LLMs.

To illustrate the concept let’s build a smart Olympic knowledge bot with 120 years of Olympics dataset. We are from the Olympics committee and need an efficient enterprise bot explicitly to help handle Olympics-related knowledge natural language queries.

Embeddings: Text embeddings are numerical representations of text data that capture the semantic meaning and relationships between words, phrases, or documents. These embeddings are typically derived from pre-trained deep learning models, such as Word2Vec, GloVe, or more recently, Transformers like BERT and GPT. These models map words or pieces of text into high-dimensional vectors where similar words or texts are closer in the vector space. We will leverage Chroma DB an open-source AI text embedding database which will help us store the Olympics dataset and its knowledge with semantic understanding, matching, and search ranking. Ref: https://docs.trychroma.com/. Chroma DB by default, uses an all-MiniLM-L6-v2 vector embedding model to create the embeddings for us but in this case, we will leverage OpenAI’s text-embedding-ada-002” model for generating the embeddings and limit Chroma db merely to serve as in-memory storage & retrieval collection.

In the below code, we prepare the Olympic dataset to transform it for embedding generations. Once generated they form the basis of user prompt query relevant document retrieval and ranking based on the rank retrieval cutoff number.

Simple Illustration of Embedding on OpenAI

For any user query, the smart retriever first transforms the query into its embeddings using OpenAI’s text-embedding-ada-002” model. Based on the embeddings the retriever searches the closest matched top N (5) documents. The prompt, and the documents are then passed to the OpenAI’s “gpt-3.5-turbo” model for text generation and response. The response generation is safeguarded with a system safety {“role”: “system”, “content”: “You answer questions only about Olympics and nothing else.”}

The code illustrates the solution.

#https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results
import pandas as pd
import openai
import chromadb
from chromadb.utils import embedding_functions
import os
from dotenv import load_dotenv
load_dotenv()

class prepareContextData:
def __init__(self):
self.contextDataPath = os.getenv("CONTEXT_DATA_PATH")

def transform(self):
df=pd.read_csv(self.contextDataPath)
df=df.loc[df['Year'] == 2002]
df = df.astype(str)
df['Gender'] = ["Male" if x =='M' else "Female" for x in df['Sex']]
df['Medal'] = ["but did not win any medal" if x =='nan' else "won a "+x+" medal" for x in df['Medal']]
df['Height'] = ["Unknown" if x =='nan' else x for x in df['Height']]
df['Weight'] = ["Unknown" if x =='nan' else x for x in df['Weight']]

df['text'] = df['Name'] + ' a ' + df['Gender'] + ', aged ' + df['Age'] + ', height '+df['Height']\
+' centimeters, Weight '+df['Weight']+' kilograms, from team '+df['Team']\
+' , participated in olympic games held in the year '+df['Year']+' , for the '\
+df['Season']+' season, hosted in the city '+ df['City']+' in the '+'"'+df['Sport']+'"'\
+' sporting category'\
+' for the event '+'"'+df['Event']+'"'+' '+df['Medal']
df = df.head(100)
docs=df["text"].tolist()
docs= [item.replace('"', '') for item in docs]
docs= [item.replace("'", '') for item in docs]
ids= [str(x) for x in df.index.tolist()]
return (docs,ids)

class oaiContextDataEmbedding:

def __init__(self,documents:list,ids:list):
self.documents = documents
self.ids = ids
self.openai_api_key = os.getenv("OPENAI_API_KEY")

def OpenAI_ContextData_Embedding(self):
openai_embeddingFunction = embedding_functions.OpenAIEmbeddingFunction(
api_key=self.openai_api_key,
model_name="text-embedding-ada-002")

client = chromadb.Client()
collection = client.get_or_create_collection("olympics",embedding_function=openai_embeddingFunction)
collection.add(documents=self.documents,ids=self.ids)
return collection


class retrievalAugmentedGeneration:

def __init__(self,prompt:str,num_results:int):
self.prompt = prompt
self.num_results=num_results

def generate_prompt_embedding(self,collection):
response = openai.Embedding.create(model="text-embedding-ada-002", input=self.prompt)
results=collection.query(
query_embeddings= response["data"][0]["embedding"],
n_results=self.num_results,
include=["documents"])
res = "\n".join(str(item) for item in results['documents'][0])
augmented_prompt=res+" "+self.prompt
return augmented_prompt

def getChatResponse(self,collection):
messages = [
{"role": "system", "content": "You answer questions only about Olympics and nothing else."},
{"role": "user", "content": self.generate_prompt_embedding(collection)}]
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=messages,
temperature=0
)
response_message = response["choices"][0]["message"]["content"]
return response_message

if __name__=="__main__":
prepRagContext = prepareContextData()
documents,ids = prepRagContext.transform()
promptQuery = "Provide the details of the athelete who won a Gold Medal in Olympics 2002?"
#"Which Actory played Maverick in the movie Top Gun?"
#"Which individual won the Bronze Medal for Olympics Ice Hockey Mens Ice Hockey in 2002?"
numberOfResults=5
collection = oaiContextDataEmbedding(documents,ids).OpenAI_ContextData_Embedding()
rag = retrievalAugmentedGeneration(promptQuery,numberOfResults)
print(rag.getChatResponse(collection))

GitHub Code Reference [https://github.com/keshavksingh/oai-rag-olympics]

We test the solution locally with a prompt query: “Which individual won the Bronze Medal for Olympics Ice Hockey Mens Ice Hockey in 2002?”

Olympic Query

We also test the safety of the both by checking its behavior for a irrelevant query: “Which Actory played Maverick in the movie Top Gun?” Rightfully its response is precise and exactly what we would expect.

Response Safety

Usecase: Search and Query

We are now ready to translate our solution into an API using FastAPI with 2 endpoints.

a. Query — Respond Olympics query based on the prompt.

b. Search — Similarity Search N relevant documents based on the prompt.

# Import Uvicorn & the necessary modules from FastAPI
import uvicorn
from fastapi import FastAPI, File, UploadFile, HTTPException
# Import other necessary packages
import RAG_BOT_OLYMPICS as rbo
import pandas as pd
import openai
import chromadb
from chromadb.utils import embedding_functions
import os
from dotenv import load_dotenv
import json
# Load the environment variables from the .env file into the application
load_dotenv()
# Initialize the FastAPI application
app = FastAPI()

prepRagContext = rbo.prepareContextData()
documents,ids = prepRagContext.transform()
collection = rbo.oaiContextDataEmbedding(documents,ids).OpenAI_ContextData_Embedding()

@app.post("/queryOlympics")
async def oaiChatbot(promptQuery: str):
numberOfResults=5
rag = rbo.retrievalAugmentedGeneration(promptQuery,numberOfResults)
result_json = json.dumps({"result":str(rag.getChatResponse(collection))})
print("Output Response String - !")
return result_json

@app.post("/searchOlympics")
async def oaiSearch(promptQuery: str):
numberOfResults=5
rag = rbo.retrievalAugmentedGeneration(promptQuery,numberOfResults)
result_json = json.dumps({"result":str(rag.generate_prompt_embedding(collection))})
print("Output Response String - !")
return result_json

if __name__ == '__main__':
app.run(debug=True)

Endpoint Validation.

Validate Query

Returns top 5 documents from Olympics 2002 the prompt query.

Validate Search

With this simple use case, we illustrate 2 significant capabilities a RAG-based highly contextual bot and a search/document retriever. Both of these could be expanded for diverse use cases and applications such as customer service, similarity search, ranking, recommendation, clustering, classification, associations, and enterprise knowledge systems.

I hope you find this insightful!

Leaving with a thought — It’s great to be an explorer, intensely hungry and bleeding edge, as with LLMs staying grounded and disciplined makes the real difference in value outcome.

Reference & Credits :

--

--

Keshav Singh

Principal Engineering Lead, Microsoft Purview Data Governance PG | Data ML Platform Architecture & Engineering | https://www.linkedin.com/in/keshavksingh