Processing HL7 V2.x Messages using Snowflake

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

8 min readNov 23, 2022

Contributors: Venkat Sekar, Murali Gandhirajan, Tushar Sarde, Karuna Nadadur

In the previous blog we talked about the architectural patterns for ingesting the different standards of HL7 and how they can be ingested and processed. This blog outlines how to process HL7 V2.x messages natively on Snowflake leveraging data programmability and extensibility features. We will leverage Snowflake user defined functions for Java and Python to parse the HL7 V2.x messages and load the parsed data to tables for consumption.

Why Snowflake for HL7 V2.x processing?

Java or Python functions help you to transform and augment your data using custom logic running right next to your data, with no need to manage a separate compute cluster and take the data out for processing. Users can access the functions as if it were built into Snowflake. More importantly administrators can rest easy as data never leaves Snowflake.

The below picture encapsulates the overall flow of the HL7 V2.x messages parsing data pipeline. Open source API libraries HAPI and HL7apy are leveraged inside Java UDF’s/UDTF’s and Python UDF’s respectively to parse the HL7 V2.x messages.

HL7 V2.x Parsing Using Java UDTF

Tabular UDFs, table UDFs, or, most frequently, UDTFs (user-defined table functions) are user defined functions that return a set of rows, consisting of 0, 1, or multiple rows, each of which has 1 or more columns. Using Java language you can build a Java UDTF leveraging the open source HAPI libraries to parse the HL7 V2.x messages. Snowflake functionality for processing unstructured data using Java UDF and UDTF is in public preview. HL7 messages can be accumulated to a file to be read in batch mode and using Java UDTF parse each message into a JSON format to be loaded into variant data type columns. Overall flow of the process is depicted in the below picture.

The Java UDTF code for parsing HL7 V2.x messages can be found on Snowflake Labs Github. Detailed step by step implementation of the above process is outlined in the quick start guide — Getting Started — Processing HL7 V2 Messages with Snowflake section 5. Similar to Java UDTF, a Python UDTF can be created by leveraging open source libraries such as HL7apy to parse HL7 V2.x messages. Processing unstructured file access using Python UDF and UDTF is currently in private preview and coming out soon.

HL7 V2.x Parsing Using Java UDF & Python UDF

Similar to Java UDTF, Java UDF or Python UDF’s can be created to parse the HL7 V2.x messages. Unlike UDTF, the number of messages in the file can affect the result. If the output is greater than 16MB then it might result in the call as an error/exception or only a truncated result could be returned, which would not be usable. In addition to that if the parsing succeeds without errors you will encounter an additional step of splitting each parsed HL7 message that is wrapped in a single JSON array document.

Our recommendation is if you have a continuous stream of HL7 messages to be ingested, then use your ingestion mechanism of your choice (be Kafka Connect, Kinesis, etc), load the data into Snowflake as a raw HL7 message in a table as shown below and then leverage Java UDF or Python UDF.

The overall process flow for the Java UDF or Python UDF would look like the below picture.

The Java UDF code and the Python UDF code are also available on Snowflake Labs Github. Please note that this code implementation for UDF is not accessing files and hence you don’t have to wait for the feature of processing unstructured file access using Python UDF and UDTF, which is currently in private preview. Detailed step-by-step implementation of the above process is outlined in the Quickstart guide Getting Started — Processing HL7 V2 Messages with Snowflake section 5 and 6.

HL7 V2.x Parsing Using Snowpark Python

Snowpark allows you to write code in Java, Scala, and Python, and execute it directly within Snowflake. Snowpark is built with a programming construct familiar to coders, which is often considered a more efficient way to express their code, so they can easily complete and debug data pipelines. And not only can you code pipelines with Snowpark, you can also create new functions, or register existing ones by bringing in custom and third-party libraries. Most importantly, you can execute that code directly within Snowflake without setting up additional systems or spin up clusters to execute these workloads. This way you get the performance from Snowflake’s elastic performance engine, the ease-of-use and security of the Snowflake platform, and the connectivity of the Data Cloud.

HL7 V2.x parsing using Snowpark Python is very similar to the Python UDF implementation. The main difference is that UDFs can be defined programmatically using Snowpark. High level steps are:

Define a Python UDF that would read the file from a stage and parse the file using the HL7apy library. The Python UDF can also be defined as callable from SQL.
Snowpark will then use the programatically created UDF to parse the file, from the stage.
The UDF output can be saved into a table for further processing.

Driving Value from HL7 v2 clinical data

Now that HL7 V2.x clinical data is processed and persisted in a Snowflake table, we can:

Parse each segment in a HL7 v2 message, store them in structured data tables for each segments and consolidate this clinical data with other patient data to enable Patient 360 view.
Parse HL7 message segments and map the data attributes to common data model like OMOP for patient cohort analytics and use research to improve health outcomes.
Leveraging AI/ML capabilities within Snowflake using Snowpark, we can build and train machine learning models using stored procedures, fully utilizing the compute power in Snowflake. Since the model training workload is pushed down to Snowflake, data never leaves the snowflake environment where we can guarantee better performance, security and governance. Post training, the model is saved as a model file (ex: modelname.joblib) in a Snowflake internal stage location. The model training and validation output can be persisted in a Snowflake table to measure model performance over time. Using Streams and Tasks, we should also be able to retrain the model on a frequency of your choice. Finally using the model trained in Snowpark, we can now start the inference by building a User Defined Function (UDF) accessing the model file in Snowflake internal stage location. As soon as the UDF is registered in Snowflake, we can directly access the UDF by connecting to the Snowflake account from any external application or analytics interfaces through simple SQL queries to get your predictions. The below picture outlines the flow discussed here.
Streamlit, a recent acquisition of Snowflake, is an open-source Python framework built to simplify and accelerate the creation of data applications. Data scientists and machine learning engineers can use Streamlit to build apps the same way you would write Python code; no front end development experience necessary. With Streamlit, you can turn data scripts into shareable web apps in minutes, all in pure Python. This allows you to quickly prototype and iterate, so you can build better apps on HL7 data faster.

Example Use Case

As the data is parsed and persisted in Snowflake, we can use Snowflake’s JSON handling capabilities to parse individual segment data within HL7 and leverage information in each of the segments for effective patient level analytics. For example, the Observation Value OBX segment in HL7 holds valuable patient information (see below picture). A typical Patient report text in the OBX segment — element Observation Value¹, contains the following components:

1. Clinical History — at the report level
2. Nature of Specimen — at the specimen level
3. Gross Pathology (~ blocks; AP; size, shape, color) — at the specimen level
4. Microscopic Pathology (FISH; cell details) — at the specimen level
5. Text Diagnosis — at the report level
6. ICD9 code — at the report level
7. Comment Section — at the report level

Source: https://www.naaccr.org/wp-content/uploads/2016/11/Standards-Volume-V-Summary-Table-for-HL7-Version-2.3.1-January-2014.pdf

We can read the patient report text using Snowflake’s JSON handling capabilities, run python NLP models natively on Snowflake to extract key entities from these texts, identify additional diagnosis values to find out potential comorbid conditions of patients, flag any gaps in coding for better financial reimbursement, populate disease registries and identify potential next best action based on clinical information of patients.

Summary

In summary, Snowflake’s data programmability and extensibility features enables customers to easily process HL7 V2.x messages using their language of choice such as Java or Python. Once the Clinical data lands inside Snowflake, you can derive the value of data by leveraging AI/ML capabilities of Snowflake and extend it to build data applications. In the next blog, we will go deeper into some of the new Snowflake features such as Native Apps framework and discuss how it can be leveraged for HL7 data formats processing and extend the capabilities beyond organizations through Snowflake Data Collaboration.

Source:
1 Doc: HL7 v2.3 — OBX — Observation segment

Read More: