GenAI assisting with healthcare coding extraction to calculate appropriate patient risk scores for better reimbursements

Snowflake Cortex AI to uncover healthcare coding insights

In collaboration with Mihir Bhojani, Senior Solutions Engineer-Snowflake & Brian Baillod, Senior Sales Engineer-Snowflake

Business Context
In healthcare, accurately calculating patient risk scores is critical for better reimbursements. Providers must balance high-quality care with financial stability, making precise risk scores essential. These scores, derived from demographic and clinical factors, determine a patient’s financial risk. Accurate assessment leads to better decisions on resource allocation, treatment plans, and care management, improving patient outcomes and resource efficiency.

Coding extraction plays a vital role by ensuring that all relevant patient data is captured and properly coded. With widespread electronic health records (EHRs) and a focus on data interoperability, providers and payers have access to vast information. However, effective data extraction and interpretation require specialized knowledge and tools. Accurate coding helps healthcare organizations identify key risk factors and assign appropriate risk scores.

Opportunity

By leveraging GenAI techniques, healthcare organizations can streamline the coding extraction process and improve the accuracy of patient risk score calculations.

This blog explores the use of GenAI for coding extraction from unstructured patient transcripts data and presents missing diagnosis codes and corresponding HCC coding for humans to consider and appropriately qualify for patient risk score calculations using Snowflake Cortex AI and Streamlit capabilities.

Implementation Overview

Figure1: GenAI assisting in Coding Extraction: Process Flow

Step 1: Parse and extract medical transcripts data from patient transcripts and load in Snowflake table

As a first step, we need to get the medical transcripts parsed, chunked into smaller chunks and loaded into a Snowflake base table. Here are the detailed steps involved.

Load medical transcripts into a cloud storage bucket. In this case, scrubbed medical transcripts downloaded from publicly available sources on the internet are loaded into S3 bucket and then storage integration for that s3 bucket is created in Snowflake. Please refer to Snowflake documentation on how to create s3 storage integration.

Create an external stage in Snowflake — steps here. We can now access these PDFs in Snowflake as shown in Figure2.

Figure2: PDF Documents in External Storage accessed in Snowflake

Now that medical transcripts are loaded into an external stage and accessible within Snowflake, we have to read those PDFs, extract contents and break the contents into smaller chunks if necessary. For that we can create a python function that uses pyPDF2 to extract the contents and use langchain function to chunk as needed. Deploy this python function as User Defined Table Function (UDTF) on Snowflake which takes document URL as input and returns document URL and extracted text.

As a next step, create a table docs_and_text and run the UDTF for all the documents in the external stage and load the output into that table docs_and_text. Figure3 shows the output of that table.

Figure3: Medical Transcript PDFs extracted into a column in Snowflake table

Please note that your unstructured patient data is typically gathered from multiple sources and having a data product approach to collate all patient data including unstructured data is foundational to deploying AI/ML capabilities at scale. What we covered here as medical transcripts is one of the many sources for patient’s unstructured health data where meaningful diagnosis code for risk scoring can be extracted.

Step 2: Use Snowflake Cortex AI to extract diagnosis codes and suggest HCC codes

Let us put AI to use now! Pass the extracted text content from medical transcripts as input to the Snowflake Cortex Complete function and ask LLM to identify ICD10 diagnosis codes. Few considerations:

  1. Pick the right model size that performs fairly well for coding extraction
  2. Context window length: Oftentimes, the unstructured clinical notes and transcripts run multiple pages, choosing a model that has better context window helps to avoid chunking the transcripts
  3. Play around with prompt engineering to get the codes in the format that requires less parsing post extraction of codes from the LLM model

For list of models available in Snowflake cortex and its context window, please visit this page. In this case, we have picked “reka-flash” which has context window of 100K tokens and found to be performing fairly well in extracting ICD10 codes and HCC codes.

Create a new table with the identified ICD10 codes and HCC codes using a SQL like below. Here, we wanted to push the LLM and ask it to identify HHS based HCC codes as well as CMS based HCC codes based on latest version.

create or replace table docs_hcc_coding as 
select relative_path,
doc_text,
SNOWFLAKE.CORTEX.COMPLETE('reka-flash', concat(doc_text||'Given this medical transcript, list major ICD10-CM diagnosis code in this format ONLY: X##.# (Diagnosis description). Dont provide explanation')) as AI_ICD10_Code,
SNOWFLAKE.CORTEX.COMPLETE('reka-flash', concat(AI_ICD10_Code||'List the CMS HCC code version 28 for this ICD10 code in format: HCC-XYZ (HCC Category Description). Dont provide explanation')) as AI_CMS_HCC_Code,
SNOWFLAKE.CORTEX.COMPLETE('reka-flash', concat(AI_ICD10_Code||'List the HHS HCC code version 28 for this ICD10 code in format: HCC-XYZ (HCC Category Description). Dont provide explanation')) as AI_HHS_HCC_Code,
from docs_and_text
;

Output of the table looks like below:

Figure4: Output of extracted codes using GenAI in Snowflake Table

Please note that accuracy of these extracted codes are not verified by coders. Intention here is to have GenAI heavy lift the extraction of codes and present it for coders to verify and qualify the right fit ones for risk score calculation. This is humans-in-loop application of GenAI geared towards productivity increase of coders who are involved in identifying these missing codes from patient data

Step 3: Build and deploy streamlit app that presents these identified ICD10 diagnosis codes/HCC codes for coders review and qualify missing diagnosis codes for risk score calculation

Once we have the extracted codes loaded into Snowflake table as described in above step, next step is to create a streamlit app that lets coders visualize the extracted ICD10 codes and HCC codes and do additional analysis as required to validate the codes. As part of the validation process, coders can be given access to ask further questions to LLM with the context of a particular patient transcript that helps them to make a decision. Once a decision is made, coders can be allowed to submit the validated codes as input for risk scoring model or reject them and document the reason which can be captured in a table in the backend for future reference and auditing.

Please refer to documentation here for further information about Streamlit and how to stand up a simple streamlit application like below.

Figure5: Streamlit App showing the extracted ICD10 codes using Snowflake Cortex & reka-flash LLM
Figure6: Streamlit app showing HCC codes for extracted ICD10 codes

Conclusion

Integrating GenAI in healthcare coding extraction can revolutionize patient risk score calculations. By identifying and extracting ICD10 and HCC codes from unstructured patient transcripts with human-in-the-loop validation, healthcare providers can ensure more precise reimbursements and better patient care. This innovative approach streamlines the coding process, reduces errors, and benefits both providers and patients. As GenAI continues to evolve, its impact on healthcare will lead to improved efficiency, cost savings, and enhanced patient outcomes.

--

--