HealthData.ai: Health + Life Sciences Data Aggregation & Augmentation using DataZone + Amazon Bedrock

17 min readNov 7, 2023

Purpose

The purpose of this blog is to explore a use case where data aggregation might be key, coupled with augmentation. In this blog, we will go over a product framework for a business use case, followed by establishing some of the product and business goals. We will then go over some of the concerning pain points of the use case, as well as some of the user segments in question. We will prioritize the set of these users and pain points and offer a broad variety of solutions, along with some innovative extensions that could possibly be added to our product as we move forward with designing a solution. Without further ado, take a look at some of the pre requisites that might be helpful to go over before going through this blog:

Pre requisites:

RAG/Human Aligned Models: https://medium.com/@madhur.prashant7/some-dynamic-rag-implementation-non-hallucinating-fine-tuned-models-ed13b46f6a6d
DataZone use case within Finance & Trading: https://medium.com/aws-in-plain-english/financial-trading-for-investment-bankers-generative-ai-utilizing-amazon-datazone-agent-llms-a44d54811b03
RAG @ Bedrock + Containerizations: https://medium.com/@madhur.prashant7/multi-tenant-products-rag-bedrock-amazon-native-kubernetes-eks-codellama-7b-walkthrough-7da2ca955ed7
Claude & Titan Prompt Engineering Use Cases: https://medium.com/@madhur.prashant7/demo-walkthrough-claudev2-q-a-on-bedrock-using-rag-langchain-best-prompting-practices-for-8b4047d2644e

NOTE: I work at AWS, but the thoughts, ideas and implementations on these blogs are my own.

Before we get started, this blog will be two fold: First, we will walk through the entire business use case and product framework, and then we will have a set of prioritized solutions to implement. Second, we will go over all possible architectural pieces of the solution and propose some code walkthroughs and an AWS solution architecture. Now without further ado, let’s get started:

Product Use Case: Data Aggregation For HealthCare + Life Sciences

Our product use case is to be able to develop a medical health product that streamlines the collection, integration, and enrichment of healthcare and life sciences data to empower medical researchers, pharmaceutical companies, healthcare professionals. The goal of this product is to efficiently get accurate responses, valuable insights into all of the data, to promote medical advancements and reduce the manual overhead required to prepare, load, access data from a central data lake.

Vision of our Product

The vision of our product is three fold: Firstly, our medical health product would focus on accuracy and aggregation — having all patient data, comprehensive research data to be accurately represented in an aggregated fashion. Secondly, we want our product to be scalable, such that more the data, more the different domains/sources of data, more easily scalable it is without any major tweaks. Thirdly, we want to be able to promote and value medical researchers time and accessibility to data, from a central data lake in an efficient manner. For the purpose of this blog, we will focus on our vision to scale this geographically within one region, focusing on accuracy, aggregation, scalability and lastly accessibility of data. Now, without further ado, let’s establish some business and product oriented goals before moving forward with user segmentations, pain points and solutions.

Business Goal

The main business goal for this product is to act as a central leading medical data aggregation service, and specifically focus on augmentation for better data accessibility. This involves creating partnerships with healthcare institutions, research organizations, and pharmaceutical companies to expand outreach and the reliable flow of data to and from our product platform. This brings us to our main goal, and that is user acquisition, and engagement. For the purpose of this blog, we want our users (medical professionals, researchers and so on) to successfully acquire the product based on certain metrics that we want to track along with driving engagement on the platform. User engagement specifically for our product will focus more towards building reliability and trust, and it is important to have drive engagement as a core business goal. Once we hit these targets we can focus on revenue/monetization as a goal to scale and grow.

Product Goal

The product goal is three fold: firstly, we want the data to be readily and easily accessible to our user base, and for that, we want our product to have a unified, interactive and user friendly platform that can seamlessly aggregate and augment healthcare and life sciences data. Secondly, we want the data to be trusted and easily accessible, making HealthData.ai more reliable for our product users. We want to enable the users to get accessibility into a vast variety of datasets, conduct research and make data driven decisions, promoting health and medical care services. We envision to create a product that focuses on two main components for this: Generative AI pipeline along with some coding tools and containers, as well as Data drive services that couple together to provide real time accessibility, insights and so on.

Now that we have defined the vision of our product to promote scalability, reliability, accessibility and accuracy of data, business goals of promoting medical and health care services and systems, as well as promoting user acquisition and engagement, and our product goal to increase interactivity of the platform, to use the readily available data to perform diverse data analytical tests, research, we can focus on what users can really come into this use case. Without further ado, let’s get into it:

User Segments

For our HealthData.ai product, we can take a look at a variety of user segments that we want to target to create a product that makes data easily accessible and augment actions for certain users:

Medical Researchers/Scientists: These professionals require access to vast datasets to conduct studies, perform data analysis, and make breakthroughs in medical research. These medical researchers heavily rely on data from various sources, to be accurate and reliable to drive decisions in their research experiments and papers. This is a potential user segment for our product that might benefit from data aggregation and augmentation.
Pharmaceutical Companies/Organizations: Pharmaceutical companies rely on comprehensive data to develop new drugs, conduct clinical trials, and gain insights into drug efficacy and safety. Data aggregation and augmentation are vital for their drug development processes. This user segment could be viewed as those who drive medical innovations with the data that is available to them, and to focus more on developing and making these drugs accessible, they might benefit from HealthData.ai to make it more time and add more value towards their end goal and users rather than collecting data from various sources.
Public Health Organizations: Organizations that focus on public health initiatives need aggregated data to track disease outbreaks, monitor health trends, and make informed decisions. This user segment depends on timely and accurate data for public health planning that might be critical to the end users and their health, and for this making data readily and easily accessible is of the highest priority. HealthData.ai might be beneficial in this scenario to make that a possibility.
Various Healthcare Providers: Hospitals, clinics, and healthcare professionals need access to patient records, medical histories, and treatment data to provide the best possible care. This data might be spread across various tables and sources and data analysts and IT members might have to do days of data preparation to make that accessible to the users. This requires time effort and human intervention which might make it prone to errors, and so, HealthData.ai might be a helpful solution for this use case as well.
Emerging HealthTech Startups: Emerging companies in the health tech space need data for developing innovative solutions and improving healthcare delivery. These companies/start ups are always looking to get more data as a source to test their MVPs, and be able to make regular and accurate adjustments to their data analytics pipeline and for this, HealthData.ai might be a relevant service to use.

Prioritizing the User Segmentation: To prioritize user segments, we will consider two main metrics: the level of impact on users and the market size/scalability of the user base. This will outline the two main metrics to look out for when trying to drive acquisition for the users who can get a positive impact from our product, as well as a scalability perspective to grow and launch new feature sets with a dedicated set of users.

Based on these metrics, we will prioritize the following two user segments:

Medical Researchers: Medical researchers have a high level of impact on users as they play a crucial role in advancing medical knowledge and healthcare practices. Providing them with accurate and accessible data can significantly accelerate medical research and innovation. Additionally, the market size for medical researchers is vast, as research institutions and professionals worldwide rely on data for their work. The estimated market size of this user segment in the United States is 160,000 users as well as the impact it provides to their end users and respective markets they serve.
Pharmaceutical Companies: Pharmaceutical companies have a substantial amount of impact on users as their research and drug development efforts directly influence the status of public health and their well-being. Access to comprehensive and high-quality data is essential for their operations. Furthermore, the pharmaceutical industry is a massive market with significant scalability potential, making it an attractive target for our product’s expansion.

Now that we have narrowed down to two sets of user segments: Medical Researchers and Pharmaceutical Companies based on the level of impact and scale, we can go ahead and define some of the concerning pain points these users might face in aggregating this data and completing their user journeys.

Pain Points

Now that we have talked about prioritizing our user segments to medical researchers/scientists and pharmaceutical companies, lets talk about the relevant pain points these users might face in the world of data and manual user actions:

1. Pain Points for Medical Researchers:

Data Fragmentation: Medical researchers often deal with fragmented healthcare and life sciences data from various sources, including hospitals, research institutions, and public health organizations. A main struggle for these users is to be able to store this data into some sort of a central repository and still have each of them be different and well organized.
Data Quality: Ensuring the quality and accuracy of data is a significant challenge. Medical researchers require high-quality, error-free data to make reliable innovations and conclusions from their studies. This directly affects and impacts the end users who are treating using data driven decisions. We want to be able to make medical decisions using data much more accurate and reliable.
Time-Consuming Data Preparation: Data preparation can be a time-consuming and labor-intensive process, involving data cleaning, transformation, and integration. Researchers can spend a considerable amount of time first, collecting the data, preparing it and then being able to store it to access it. This takes their valuable time coming up with data driven decisions and consumes more time towards just preparing and storing this intensive amount of data.
Data Accessibility: Researchers may face difficulties in accessing specific datasets, particularly if the data is spread across different sources, formats, or is not readily available. Data accessibility is crucial for their work.
Data Security, Governance and Compliance: Handling sensitive patient data comes with strict security and compliance requirements. Researchers want to not only make the data accessible but segregated, like containers in an ECS cluster, open to scale but separate from each application. In the same way, we want to be able to promote data security, compliance and governance, and be more in control of who can access it and who cannot.
Scalability of Existing Solutions requires major changes/additions: As research projects grow, researchers need scalable solutions to handle increasingly large datasets efficiently which is currently a pain point. These users might be using a tightly coupled architecture/pipelines that might make making changes to data repositories tough, so we want to make this loosely coupled, more accessible and independent of each other sources around.

2. Pain Points for Pharmaceutical Organizations:

Data Silos: Pharmaceutical companies often encounter data silos, where data is stored in different locations, siloed from each other, making it a challenge to be able to retrieve this data quickly and in an efficient manner.
Regulatory Compliance: The pharmaceutical industry is highly regulated, and companies must adhere to strict data compliance and reporting requirements. Ensuring data is compliant can be a significant challenge.
Clinical Trial Data Management: Pharmaceutical companies conduct clinical trials that generate vast amounts of data. Managing, analyzing, and aggregating this data for decision-making is complex and time-consuming.
Drug Discovery: Data is crucial in drug discovery and development. Companies require comprehensive data to identify potential drug candidates, assess their effectiveness, and monitor safety concerns.

Prioritizing the Pain Points
Let’s focus on prioritizing two pain points per user segment as follows:

Prioritizations for medical scientists:

Data Fragmentation: Data fragmentation is a significant challenge for medical researchers. Researchers often have to collect data from various healthcare sources, making it challenging to aggregate and use effectively. They need a solution that can centralize and integrate data from multiple sources for seamless analysis.
Time-Consuming Data Preparation: Data preparation can be time-consuming and tedious, involving cleaning, transformation, and integration. Researchers would greatly benefit from tools or platforms that automate these processes, allowing them to focus more on the actual research and analysis rather than data wrangling.

Pain Points for Pharmaceutical Organizations:

Data Silos: Pharmaceutical companies often face data silos where valuable data is stored in different systems or departments. These silos hinder data accessibility and collaboration, making it challenging to gain a holistic view of information. They require a solution that breaks down these data silos and facilitates seamless data sharing.
Regulatory Compliance: Maintaining compliance with stringent regulations is a constant challenge for pharmaceutical organizations. The complexity of regulatory requirements and the potential legal consequences of non-compliance make this a high-priority pain point. They need tools and systems that can automate and streamline compliance processes.

Solutions: HealthData.ai

Now that we have focused on the two user segments, defined the product, business goals and have prioritized pain points, let’s focus on some of the solutions that might be potential feature sets of our product as we launch our MVP:

Solutions for Medical Researchers:

Unified Data Integration Platform (GLUE + DATAZONE): Develop a comprehensive data integration platform that allows medical researchers to connect, collect, and centralize healthcare and life sciences data from various sources. This platform should include connectors to popular healthcare data systems and sources, for different groups on the platform to be able to subscribe to data, access it and be able to use it for their personal use cases.
Generative AI Data Preparation (BEDROCK + SAGEMAKER): Implement automated data preparation tools that can clean, transform, and structure the collected data. Here, we can implement various fine tuned machine learning models or RAG powered Bedrock models such as Claude-V2.

Solutions for Pharmaceutical Organizations:

Data Silo Breakdown (DATAZONE): Create a platform that integrates data from different departments within pharmaceutical organizations, breaking down data silos. This platform should provide role-based access controls in order to ensure governance and security/authority of the data.
Regulatory Compliance Management (LAKE FORMATION): Develop a compliance management module that automates regulatory reporting and ensures data is always in compliance with industry regulations.

Other Relevant Features for Both User Segments for HealthData.ai MVP:

Advanced Analytics and Insights: Provide advanced analytics and visualization tools that enable users to derive meaningful insights from the aggregated data. This can include machine learning models for predictive analytics and data-driven decision-making.
User-Friendly Interface: Create a user-friendly and interactive web interface that allows users to easily navigate, search, and access the data. Intuitive data exploration and reporting tools should be part of the interface.
APIs and Integration: Offer APIs that allow seamless integration with existing healthcare or pharmaceutical systems. This makes it easy for users to connect HealthData.ai with their existing workflows.
Data Augmentation with Generative AI: Leverage generative AI models, such as RAG (Retrieval-Augmented Generation), to automate the augmentation of data. This can assist in generating missing data points or simulating scenarios for research and development.
Scalability and Performance: Design the system to be highly scalable, capable of handling large datasets as researchers or organizations grow. Utilize cloud-based infrastructure to ensure performance and scalability.
Support and Training: Provide comprehensive support, documentation, and training resources to assist users in maximizing the value of HealthData.ai.
Feedback Iteration/Loop: Develop a feedback iteration loop between the end users of our product to be able to communicate and have regular updates to the design and the HealthData.ai product infrastructure.

Solution Walkthrough: Code + AWS Architecture

Approach: AWS Solution Service Breakdown

The way we will go about this is by breaking the solution down with major services that will solve our pain points that HealthData+ aims to solve, followed by minor micro serviced based architectures, a full architectural and some innovations of how services could be used out of the box with generative AI, followed by lastly some code samples.

Approach: Creating a Unified Data Aggregation/Integration Platform Utilizing Amazon DataZone — Optimizing for a Data Silo Breakdown

DataZone’s Role: DataZone can help in creating the foundation for a unified data integration platform by serving as a central data repository. It provides a secure and scalable data lake where data from various sources can be stored, organized, and made accessible. DataZone’s ability to handle healthcare and life sciences data, including those from different formats and sources, is instrumental in aggregating and centralizing the data. It provides a dedicated space to subscribe to, access, and use data for different user groups, facilitating a unified platform.
Data Silo Breakdown DataZone: DataZone is instrumental in breaking down data silos within pharmaceutical organizations. It offers data consolidation and storage capabilities, which allow data from different departments to be integrated into a central repository. With role-based access controls provided by DataZone, governance and security are ensured, as different roles within the organization can access data with appropriate permissions.
Now, we can take a look at some of it in action:
STEP 1: Create multiple sources of data accessibility projects integrated with your data late/different sources:
Create a DataZone domain for the portal where you can access all of the data in different sectors for medical related research material:

Once you have created the domain, open the data portal:

In this we can browse and create a catalogue with all of the metadata from various data sources for aggregation purposes and be able to have a central repository data lake solution for our healthData+ use case.
Create separate project categories for various medical data types — Assign rules and have specific users in different sectors of the medical organization subscribe and get approved to only use specific data based on their use case (kind of like IAM roles within AWS)

Let’s imagine have different data sources linked to different kind of projects based on the type of medical use case.

Once you have gone over all of this, you can view the data that has been subscribed to and the data that is approved by specific medical organizations within the company:

Feel free to open and double check the metadata regarding medical research information on athena as given below:

Generative AI Potential Extension

In order to have this scalable and even more readily accessible, we can in the future focus on creating a multi tenant chatbot using Amazon DataZone, considering:

Each project acts as a knowledge base, specifically related to different sorts of medical data
An agent (bedrock or langchain) can be attached to that specific project getting access to the data coming in → these agents might have certain tasks to do or roles to follow within a given project
All agents revert back to a common interface acting as a multi tenant chatbot, that would require credentials for specific users to gain access to only the data that they have subscribed to and have been approved from. This creates a secure chain of generative AI pipeline system, using RAG (Retrieval Augmented Generation) on each sort of metadata in each project, convert that into embeddings and display it based on the customer query.

A quick rough structure of what this solution might look like is as follows:

TRIGGER EVENT BRIDGE: LLM FOR EVENTS WITHIN DATAZONE

For now, you can trigger an event pattern through event bridge as a rule to get invoked whenever there is an event within the datazone portal. This can be used to regulate the number of subscribers for a project, how many times someone might have unsubscribed and all of the following:

Domain Creation Failed
Domain Deletion Succeeded
Domain Deletion Failed
Space Deployment Started
Space Deployment Completed
Space Deployment Failed
Project Creation Succeeded
Project Member Addition Succeeded
Project Member Removal Succeeded
Project Member Role Change Succeeded
Space Deployment Customer Workflow Initiated
Subscription Request Created
Subscription Request Accepted
Subscription Request Rejected
Subscription Request Deleted
Subscription Created
Subscription Revoked
Subscription Cancelled
Subscription Grant Requested
Subscription Grant Completed
Subscription Grant Failed
Subscription Grant Revoke Requested
Subscription Grant Revoke Completed
Subscription Grant Revoke Failed
Asset Added To Inventory
Asset Added To Catalog
Asset Schema Changed
Business Name Generation Succeeded
Business Name Generation Failed
Data Source Created
Data Source Updated
Data Source Run Triggered
Data Source Run Succeeded
Data Source Run Failed
Asset publishing succeeded
Asset publishing failed

View: https://docs.aws.amazon.com/datazone/latest/userguide/working-with-events-and-notifications.html

Create a chatbot to regulate events within the datazone portal to stay up to date with activities online with projects and data, for that, create an event rule as follows:

{
  "source": ["aws.datazone"],
  "detail-type": [{
    "anything-but": ["AWS API Call via CloudTrail", "AWS Service Event via CloudTrail"]
  }]
}

Once you call this, invoke a lambda function to store the events within an S3:

import json
import boto3

s3 = boto3.client('s3')def lambda_handler(event, context):
    try:
        # Extract data from the EventBridge event
        domain_name = event['detail']['domainName']        # Define the S3 bucket and object key
        bucket_name = 'datafromdatazone'
        object_key = f'events/{domain_name}/{context.aws_request_id}.json'        # Upload event data to S3
        s3.put_object(Bucket=bucket_name, Key=object_key, Body=json.dumps(event))        return {
            'statusCode': 200,
            'body': 'Event data saved to S3 successfully.'
        }
    except Exception as e:
        print(f"Error: {str(e)}")
        return {
            'statusCode': 500,
            'body': 'Error saving event data to S3.'
        }

Once you have stored all of the events from eventbridge into S3, you can create a chatbot using lex and kendra and go over all of the data you have stored. This might be helpful for tracking activity within an organizational use of DataZone to keep track. Some of the code for invoking an LLM is as follows (using bedrock):

Prepare the event data from datazone and create embeddings
import numpy as np
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader
loader = PyPDFDirectoryLoader(“./yoga_data/”)
documents = loader.load()
# — in our testing Character split works better with this PDF data set
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size = 1000,
chunk_overlap = 100,
)
docs = text_splitter.split_documents(documents)
sample_embedding = np.array(bedrock_embeddings.embed_query(docs[0].page_content))
print(“Sample embedding of a document chunk: “, sample_embedding)
print(“Size of the embedding: “, sample_embedding.shape)

Store event data type in a vector DB:

from langchain.chains.question_answering import load_qa_chain
from langchain.vectorstores import FAISS
from langchain.indexes import VectorstoreIndexCreator
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
vectorstore_faiss = FAISS.from_documents(
docs,
bedrock_embeddings,
)
wrapper_store_faiss = VectorStoreIndexWrapper(vectorstore=vectorstore_faiss)
vectorstore_faiss.save_local("faiss_index")

Prompt engineer bedrock to return details on event data from datazone:

prompt_template = """You will be acting as a event driven notifier. Use the follow [instructions] to answer the [question] based on the [context] provided. Don't answer if the answer is not present in the [context]. Follow the [output_format] given below while responding.

context = {context} // context is event data extractedinstructions = Use following instructions to answer the question above. Make sure to include following [attributes] in your answer as applicable.
- List all events given in the context
- List the events that occurred the most number of times within the portal
- List the events that occurred the least number of times within the portaloutput_format = Provide your output as a text based paragraph that follows the instructions below
- Each and every sentence is complete, ending with a full stop
question = {question} 
output_format = Provide your output as a detailed paragraph that contains all [attributes] following all [instructions] above
answer: """
PROMPT = PromptTemplate(
template=prompt_template, input_variables=["context", "question"]
)
qa = RetrievalQA.from_chain_type(
llm=llm_titan,
chain_type="stuff",
retriever=vectorstore_faiss_titan.as_retriever(
search_type="similarity", search_kwargs={"k": 9}
),
return_source_documents=True,
chain_type_kwargs={"prompt": PROMPT}
)

Conclusion

The extend of data aggregation and generative AI is not limited, and as technologies evolve, aggregation of data (whether it be structured/unstructured/files and so on) and generative AI would be a strong component. An extension of this might be — how can we automate this process with dynamically changing data from a data aggregator?

LinkedIn: https://www.linkedin.com/in/madhur-prashant-781548179/
Github: https://github.com/madhurprash