Simple, GenAI Powered, RAG based Chat application for PubMed Medical Articles using Elasticsearch and Azure OpenAI

Pratik Parate
11 min readJun 24, 2024

--

Introduction

In this fast-growing field of Generative AI, this is my attempt to show how we can build a simple, context aware, Retrieval Augmented Generation (RAG) based Chat application for the Medical articles using Elasticsearch and Azure Open AI for generating the human-like responses over contextual or private data.

I have used Azure’s ‘text-embedding-ada-002’ model to create vector embeddings of abstracts from PubMed articles. The chat application is built over ‘COVID-19 Research Papers Dataset’ from Kaggle, and uses the generation capability of ‘GPT-3.5’ model hosted on Azure.

Flow Diagram

Fig. 1: RAG Based Chat application for PubMed Medical Articles

What are LLMs (Large Language Models)

A large language model is a transformer based deep neural network associated with text-based tasks. The models called as “Foundational” models are trained on massive unlabeled data, which is publicly available information from websites like GitHub, Wikipedia etc. Training with such humongous data makes the network learn the meaning of the words, semantics and makes it context aware, which helps to generate new responses, summarize the provided content in the correct way or provide information queried by the user. These capabilities are utilized in different fields where LLMs can serve as chatbots, translator, personalized assistant, etc.

The foundation models can be used as base models and can be trained on custom data for any downstream tasks. An efficient way to utilize the LLMs’ capability of generating responses is to use RAG methodology over custom data for generating responses in chat applications or information retrieval applications.

What is Retrieval Augmented Generation (RAG)?

For chat applications, RAG component takes the query from the user and retrieves the associated information from the private database. The external information corresponding to the use case is stored in a database along with its vector embeddings and is not exposed to LLM. The query from the user is first converted to vector embedding and compared with the data in vector database to get the best possible match. The input prompt to the LLM is then customized to take the combination of the User query and information retrieved from the private database. The LLM utilizes this data and public dataset on which it was trained to generate efficient and improved responses. Refer Fig. 1 above for the complete flow

Why Elasticsearch?

Elasticsearch is an Open-Source, Distributed, Document based NoSQL database which is popular search and analytics engine. It is capable of storing, indexing, searching, and analyzing high volume of data in near-real time returning results in milliseconds.

It provides robust security features like Role-Based access controls(RBAC) and encryption capabilities. Due to simple REST-based APIs, simple HTTP interface and extensive compatibility with various programming languages it is very easy to use.

Compared to other vector databases, Elasticsearch has flexible search capabilities, thus providing more accurate results. It allows the vector embeddings to be stored using Dense Vector fields. It provides efficient kNN(k-Nearest Neighbors) search over the Dense Vector field and allows ranking the documents by the script_score. It supports storing of vector of sizes as long as 4096 floating point values. Elasticsearch provides different similarity metrics such as cosine(default), l2_norm, dot_product and max_inner_product to perform kNN search.

With the above features considered, Elasticsearch proves to be one of the appropriate choices for storing Vector embeddings along with the document.

Azure OpenAI

Azure OpenAI service is the OpenAI offering by Azure. It provides the REST APIs to access Open AI’s language models including the GPT-3.5-Turbo and text-embedding-ada-002 embeddings models that we have used in this application.

For detailed information, refer What is Azure OpenAI Service?

Implementation

For the implementation, I have used ‘COVID-19 Research Papers Dataset’ dataset from Kaggle.

Step 1: Data cleaning and Embeddings Generation

After initial cleaning of the CSV Dataset obtained from Kaggle, below is how the schema of the cleaned data looks like

+----------------+------------------+
|Column Name | Column Data Type |
+----------------+------------------+
|pmid | integer |
|doi | text |
|journal | text |
|country | text |
|title | text |
|authors | text |
|abstract | text |
|citation_count | integer |
|published_at | text |
+----------------+------------------+

Next step is to create vector embeddings over ‘abstract’ column. To create Vector Embeddings, I have used the ‘text-embedding-ada-002’ model from Azure OpenAI.

Add vector field ‘ada_embedding’ using below get_embedding() function. Replace AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT=’text-embedding-ada-002'. Azure Open AI client expects setting below two environment variables.

AZURE_OPENAI_ENDPOINT='https://<AZURE_OPENAI_ENDPOINT>.openai.azure.com/'
AZURE_OPENAI_API_KEY='<AZURE_OPENAI_API_KEY>'

Refer openai-api-key-azure-ai-studio to find OpenAI Endpoint and API Key.

from openai import AzureOpenAI
import pandas as pd
import os

client = AzureOpenAI()

def get_embedding(text, model=“AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT"):
text = text.replace("\n", " ")
return client.embeddings.create(input = [text], model=model).data[0].embedding

df_token = pd.read_json("papers_small_cleaned.json")
df_token["ada_embedding"] = df_token["abstract"].apply(get_embedding)
df_token.to_json(os.path.join(os.getcwd(), "data", "papers_small_embedding.json"), lines=True, index=False, orient='records')

Above function adds a column ‘ada_embedding’ by converting ‘abstract’ into the vector of size length 1536 floating point numbers.

The resulting document along with additional vector field is as below.

+----------------+------------------+
|Column Name | Column Data Type |
+----------------+------------------+
|pmid | integer |
|doi | text |
|journal | text |
|country | text |
|title | text |
|authors | text |
|abstract | text |
|citation_count | integer |
|published_at | text |
|ada_embedding | [float]256 |
+----------------+------------------+

Step 2: Creating Elasticsearch Index

Create Elasticsearch (ES) python client using below. Replace <your_es_endpoint> and <your_api_key> with the endpoint and api_key of your Elasticsearch cluster.

from elasticsearch import Elasticsearch

es_client = Elasticsearch(
"https://<your_es_endpoint>:443",
api_key='<your_api_key>'
)

es_client.info()
ObjectApiResponse({'name': 'instance-0000000001', 'cluster_name': 'chuw2o1ibreqop18bepraxehi052afre', 'cluster_uuid': 'RvWLPyaU7uswegrapPETdc', 'version': {'number': '8.13.2', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '16cc90cd2d08a3147ce02b07e50894bc060a4cbf', 'build_date': '2024-04-05T14:45:26.420424304Z', 'build_snapshot': False, 'lucene_version': '9.10.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

Create PubMed index as below

# Create Pubmed data Index
index_mapping = {
'mappings': {
'properties': {
'pmid': {
'type': 'integer'
},
'doi': {
'type': 'text'
},
'journal': {
'type': 'text'
},
'country': {
'type': 'text'
},
'title': {
'type': 'text'
},
'authors': {
'type': 'text'
},
'abstract': {
'type': 'text'
},
'citation_count': {
'type': 'integer'
},
'published_at': {
'type': 'text'
},
'num_tokens': {
'type': 'integer'
},
'ada_embedding': {
'type': 'dense_vector',
'dims': 1536,
}
},
},
}

try:
es_client.indices.create(index="pubmed_index", body=index_mapping, timeout="30s", ignore=[400, 404])
except Exception as e:
print("ERROR: Could not create Index Mapping")

Step 3: Ingesting documents to Elasticsearch

Use Bulk API to index multiple documents to Elasticsearch. To index documents to Elasticsearch, it expects each document to be prepended with below JSON document. Where ‘id’ needs to be a unique value. In our case, I have used ‘pmid’ as a unique identifier.

{"index": {"_index": "pubmed_index", "_id": 34013297}}

Use below code to achieve the expected format.

import json
import os
with open(os.path.join(os.getcwd(), "data", "papers_small_embedding_index.json"), 'w') as fw:
with open(os.path.join(os.getcwd(), "data", "papers_small_embedding.json"), 'r') as fr:
content = fr.readlines()
for line in content:
json_line = json.loads(line)
index_dict = { "index": { "_index": "pubmed_index", "_id": json_line.get("pmid")}}
fw.write(f'{json.dumps(index_dict)}\n')
fw.write(line)

Each Bulk API can have a payload of not more than 100MB, which is the maximum size of payload a requests API can handle. Elasticsearch API internally uses GET, POST APIs to perform actions. So, if the size of the documents to be indexed to ES are more, preprocessing is required to divide the JSON document files such that each file is not more than 100MB and iterate over the files to ingest in ES.

Index the articles to Elasticsearch using below

# Bulk index documents to ES. 
import json
with open(os.path.join(os.getcwd(), "data", "papers_small_embedding_index.json"), 'r') as fp:
content = fp.read()
es_client.bulk(operations=content, pipeline="ent-search-generic-ingestion")

Step 4: Developing the Chat Application

Now that we have our data cleaned and prepared, it is time to develop our chat application.

Let us consider below question from a user

USER_Question = "What is Sarcoidosis?"

Get the vector embeddings for above question from the get_embedding() function defined above in Step 1.

question_embeddings = get_embedding(USER_Question)

Apply Cosine Similarity query to get the most relevant documents from ES PubMed Index by comparing the ‘question_embeddings’ and ‘ada_embedding’ field in ES Index.

Sort the retrieved documents in descending order of similarity score.

# Using Cosine Similarity Get the relevant document from ELasticsearch
script_query = {
"script_score": {
"query": {"match_all": {}},
"script": {
"source": "cosineSimilarity(params.query_vector, 'ada_embedding') + 1.0",
"params": {"query_vector": question_embeddings}
}
}
}
resp = es_client.search(index='pubmed_index', query=script_query)

We can verify that the retrieved documents are in descending order (i.e., the relevant document at the top) by checking the ‘_score’ field in response. For the above ‘USER_Question’, we can see that the document with pmid=33996868 is retrieved.

hits_score = [x['_score'] for x in resp.body['hits']['hits']]
print(hits_score)

Get the ‘abstract’ of the above best matched document.

best_match = resp.body['hits']['hits'][0]['_source']
print(best_match['abstract'], best_match["pmid"])

Use below Azure Open AI API client

from openai import AzureOpenAI
oai_client = AzureOpenAI()

Create a prompt with the ‘USER_question’ and augment it with retrieved document’s abstract and utilize the generation capability of Azure Open AI deployment endpoint (gpt-3.5-turbo) to generate the answer.

messages=[
{"role": "system", "content": "You are an AI assiatant that helps with AI questions. Use the 'content' if provided to answer the 'Question'"},
{"role": "user", "content": f'{best_match["abstract"]} Question:{USER_Question}'}
]
print(f"Prompt = {messages}")
response = oai_client.chat.completions.create(
model=os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
temperature=0.7,
max_tokens=800,
messages=messages
)
print(response.choices[0].message.content)
Sarcoidosis is an immune-mediated chronic inflammatory disorder that can affect different organs in the body, such as the lungs, skin, eyes, and lymph nodes. It is characterized by the formation of non-caseating granulomas, which are clusters of immune cells that form nodules in affected tissues. The cause of sarcoidosis is unknown, but it is believed to be related to an abnormal immune response. Symptoms can vary depending on the organs involved and can include cough, shortness of breath, skin rash, and fatigue. Treatment depends on the severity of the disease and can include medications to suppress the immune system.

We can see in above response that the LLM has used the provided private data i.e., ‘abstract’ to generate the response which is more relevant to us. If the same question is sent to the LLM without augmenting with the retrieved abstract, it would generate the response based on the public data that the model was trained on. Below are the two such test cases.

In the below results, we can see that the response generated to the questions by LLM without RAG is generic and less context aware. While for the same questions, the response generated by LLM with RAG is more relevant and in context with the private data.

In case of question 2, the LLM did not have the answer, since it was not trained on the private data. But, it generated context aware response when augmented with private data.

Results

Question 1

What is Sarcoidosis?

Best matched article for above question in Database

33996868 --> Sarcoidosis is an immune mediated chronic inflammatory disorder that is best characterized by non-caseating granulomas found in one or more affected organs. The COVID-19 pandemic poses a challenge for clinicians caring for sarcoidosis patients who may be at increased risk of infection compared to the general population. With the recent availability of COVID-19 vaccines, it is expected that clinicians raise questions regarding efficacy and safety in sarcoidosis. However, studies examining safety and efficacy of vaccines in sarcoidosis are lacking. In this review, we examine the current literature regarding vaccination in immunocompromised populations and apply them to sarcoidosis patients. The available literature suggests that vaccines are safe and effective in patients with autoimmune disorders and in those taking immunosuppressive medications. We strongly recommend the administration of COVID-19 vaccines in patients with sarcoidosis. We also present a clinical decision algorithm to provide guidance on vaccination of sarcoidosis patients against COVID-19.

Generated response With RAG

Sarcoidosis is an immune-mediated chronic inflammatory disorder that can affect different organs in the body, such as the lungs, skin, eyes, and lymph nodes. It is characterized by the formation of non-caseating granulomas, which are clusters of immune cells that form nodules in affected tissues. The cause of sarcoidosis is unknown, but it is believed to be related to an abnormal immune response. Symptoms can vary depending on the organs involved and can include cough, shortness of breath, skin rash, and fatigue. Treatment depends on the severity of the disease and can include medications to suppress the immune system.

Generated response Without RAG

Sarcoidosis is a disease that causes inflammation in different parts of the body, such as the lungs, lymph nodes, skin, and eyes. It occurs when abnormal growths called granulomas form in these areas. The exact cause of sarcoidosis is unknown, but it is believed to be related to the body's immune response. Symptoms can range from mild to severe and may include fatigue, fever, weight loss, joint pain, and shortness of breath. Treatment typically includes medications to manage inflammation and relieve symptoms.

Question 2

Tell me the case of 72-year-old female with past medical history of hypertension and diabetes mellitus?

Best matched article for above question in Database

33968285 --> Coronavirus disease 2019 (COVID-19), caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused the ongoing global pandemic. It can manifest a wide range of complications depending upon the severity of infection and comorbidities of the patient. Vaccines are very important measure to provide protection against COVID-19. We report a case of 72-year-old female with past medical history of hypertension and diabetes mellitus who underwent imaging with positron emission tomography (PET) scan imaging for staging of her small cell urinary bladder cancer and was found to have hypermetabolic uptake in the deltoid muscle of the left shoulder and hypermetabolic left axillary and pectoral lymph nodes due to mRNA BNT-162b2 (Pfizer-BioNTech COVID-19 vaccine) vaccine administrated 3 days ago prior to PET scan.

Generated response With RAG

The case of the 72-year-old female with past medical history of hypertension and diabetes mellitus is that she underwent imaging with positron emission tomography (PET) scan imaging for staging of her small cell urinary bladder cancer and was found to have hypermetabolic uptake in the deltoid muscle of the left shoulder and hypermetabolic left axillary and pectoral lymph nodes due to mRNA BNT-162b2 (Pfizer-BioNTech COVID-19 vaccine) vaccine administrated 3 days ago prior to PET scan. This indicates that the patient had received the Pfizer-BioNTech COVID-19 vaccine recently and it was detected on the PET scan.

Generated response Without RAG

I'm sorry, but I am an AI language model and I don't have access to specific patient medical records. Additionally, as an AI assistant, I am not qualified to provide medical advice or diagnosis. If you have a specific question or concern regarding a medical case, I recommend consulting a licensed healthcare professional.

Enhancements

There are certainly numerous others but, below are some of the ways in which the application can be enhanced.

  1. Use Article text in chunks instead of only ‘abstract’ to provide more context to LLM
  2. Provide content from multiple articles from the database instead of only top matched. E.g., from 5 best matched Articles
  3. Limit the number of words in response, etc.
  4. Chaining of responses by adding previous answers to the context and use them to provide context aware answers using Lang chain Python module
  5. Using more advanced GPT Models

Conclusion

Above application is a generalized application that can be used over any type of data. More the data, more is the accuracy and better the quality of responses.

This chat application has been built using Elasticsearch as Vector Database. There are numerous other Vector Databases like Azure Cosmos DB, Chroma DB, etc. which can be used to store and query Vector Embeddings.

--

--