Snowflake
Published in

Snowflake

HL7 Ingestion and Processing Architectural Patterns with Snowflake

This blog outlines the architectural patterns of how Snowflake can be leveraged to ingest and process HL7 messages. Health Level Seven International (HL7) is the universal healthcare IT language spoken by the majority of providers, pharmacies, and payers. HL7 provides an interoperability framework and standards for healthcare. Over the period of time, HL7 has released multiple standards and revisions to them. Some of the prominent standards are HL7 V2.x, HL7 CDA (Clinical Document Architecture), and HL7 FHIR (Fast Healthcare Interoperability Resources).

The latest HL7 standard — FHIR is a REST implementation of HL7 with a choice of JSON, XML or RDF for data representation, compatible with the older standards, and easier to use. Though FHIR is gaining adoption in the healthcare industry, HL7 V2.x is still the most widely used and adopted. According to HL7 International, 95% of US healthcare organizations use HL7 V2.x and more than 35 countries have HL7 V2.x implementations¹. Snowflake support for structured, semi-structured, and unstructured data gives healthcare organizations options to ingest and process the different HL7 standard messages simplifying the overall architecture.

Snowflake support for all Healthcare data
Snowflake provides native support for semi-structured data formats, such as JSON, Avro, ORC, Parquet, and XML. HL7 FHIR JSON data representation messages can be natively ingested and stored into Snowflake using VARIANT column data type. Using the FLATTEN function you can query or extract the JSON objects and keys you plan to use and store into separate tables. Snowflake native support for semi-structured data formats includes:

  • Flexible-schema data types (VARIANT) for loading semi-structured data without transformation.
  • Automatic conversion of data to optimized internal storage format.
  • Database optimization for fast and efficient SQL querying.

Unstructured data is information that does not fit into a predefined data model or schema. Snowflake unstructured data support functionality (currently in public preview), makes it possible for organizations to store, govern, catalog, process, and share unstructured data directly within Snowflake. HL7 V2 and HL7 CDA messages are unstructured data categories for Snowflake. Customers can set up governed access to the unstructured data files and can share the raw data securely within the organization or externally.

Ingestion and Processing of HL7 V2.x messages
With the support for unstructured data in Snowflake, customers no longer need to run a processing pipeline of HL7 V2.x outside Snowflake. Now customers can directly store the HL7 V2.x messages in Snowflake, process them using Snowpark to parse HL7 V2.x data, and run analytics on the fly. Snowpark is a new developer framework for Snowflake. It allows data engineers, data scientists, and data developers to code in their familiar way with their language of choice, and execute pipeline, ML workflow, and data apps faster and more securely, in a single platform.

Snowpark opens up data programmability and extensibility. It brings deeply integrated, DataFrame-style programming to the languages developers like to use, and functions to help you expand more data use cases easily, and now all these can be executed inside of Snowflake using the elastic performance engine. Snowpark support starts with Scala API, Java UDFs, and External Functions, and expands to Java & Python.

The below data pipeline architecture diagram illustrates the options of how HL7 V2.x messages can be ingested and processed using Snowflake or Partner tools. Open source API libraries such as HAPI and HL7apy can be leveraged inside Snowpark as Java UDF’s or External functions.

Ingestion and Processing of HL7 FHIR messages
It is very easy and so simple to ingest and process HL7 FHIR files with JSON data representation. HL7 FHIR files can be loaded into a VARIANT data type column in a table and use the FLATTEN function to extract the objects and keys you plan to query into a separate table. The below data pipeline architecture diagram illustrates the options of how you can ingest the files in near real-time using the Kafka Connector or using Snowpipe to auto-ingest from cloud object stores (S3, ADLS, GCS) or Partner tools.

Below is an example of ingesting and processing of a Sample Patient (FHIR Specification v4.0.1) using Snowflake.

-- Create Database,Warehouse,Schema
CREATE DATABASE FHIRDEMO;
USE DATABASE FHIRDEMO;
CREATE WAREHOUSE IF NOT EXISTS FHIR_WH WAREHOUSE_SIZE = 'SMALL' INITIALLY_SUSPENDED = TRUE AUTO_SUSPEND=60;
CREATE SCHEMA RAW;
-- Create External S3 Stage and import the Sample Patient FHIR message
CREATE OR REPLACE STAGE MY_S3_STAGE URL='S3://*****/FHIR/' CREDENTIALS=(AWS_KEY_ID='*****' AWS_SECRET_KEY='******') FILE_FORMAT = JSON_FORMAT;
-- Create Table
create table patient_fhir (json_string variant);
-- Load the FHIR data to the table
copy into patient_fhir from @my_s3_stage;
select * from patient_fhir;-- Access the FHIR JSON data representation with simple SQL commands
select json_string:birthDate from patient_fhir;
select value:value from patient_fhir, lateral flatten(input => json_string:telecom);select value:text::string from patient_fhir, lateral flatten(input => json_string:address);select
json_string:birthDate::string,
TELECOM.value:value::string,
ADDRESS.value:text::string
from patient_fhir,
lateral flatten(input => json_string:telecom) TELECOM,
lateral flatten(input => json_string:address) ADDRESS;

This example clearly demonstrates the beauty of Snowflake's semi-structured data support. One key aspect to observe is that HL7 FHIR specifications change and the underlying JSON data representation changes.

What about ETL code when the specification and the underlying data representation changes?
Well, there is no ETL as you saw in the above example, so there is nothing to break.

What about existing reports?
They keep working too. The queries underlying the reports will work fine. If you want to see the new column, then the SQL needs to be refactored to account for the change. For more information please refer to the two-part blog series written by my colleague Kent Graziano.

Endless Possibilities
Once the HL7 data lands inside Snowflake, the power of the Snowflake Data Cloud opens up the options to address a wide variety of Healthcare Provider Analytics use cases. Whether you are building a new data lake or complementing an existing Data Lake, Snowflake’s cross-cloud platform breaks down silos and enables your data lake strategy.

Snowflake Healthcare Provider Reference Architecture

The raw HL7 V2.x or FHIR messages and the conformed data can be easily and securely shared internally with an organization or beyond the organization boundaries using Snowflake’s Data Sharing. Access to the third-party live data through Snowflake Data Marketplace without the need to copy or transform data enables cross-functional teams to build scalable data preparation, and accelerate Data Science workflows. Snowflake Distributed Data Clean rooms enable Healthcare customers to collaborate without revealing the underlying data and meeting the stricter Healthcare regulations and guidelines.

In summary, Snowflake’s unstructured data capabilities combined with Snowpark data programmability enables customers to ingest, process HL7 messages, and combine that with internal/external data, opening the door for addressing a wide variety of Healthcare use cases. In the next blog coming soon, I will go more deeply into how to process HL7 V2.x messages using HAPI and Snowpark with Java UDF’s.

References:
1 — http://www.hl7.org/implement/standards/product_brief.cfm?product_id=185

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store