Using Azure Cognitive Services with Snowflake for Healthcare NLP

Azure Cognitive Services are a great resource for getting insights from pre-built, complex models in short order. If you’re not familiar with Cognitive Services they are, effectively, pre-trained models that can be deployed in Azure. You can pass data to these models via an API call to retrieve insights or inference. Azure Cognitive Services fall into one of four buckets: speech, vision, language or decision. You can read more about it here.

BUSINESS CASE

Free text physician’s notes are rich with information on cases and patients, however it takes time to extract the information from the notes, appropriately label the text and present them for insights.With many Snowflake customers storing information rich text data in their cloud databases I thought I would show how to use text data stored in Snowflake to deliver text analytics insights. This blog focuses on Text Analytics for Health to demonstrate how the Language service can be used to extract and organize those insights.

ARCHITECTURE

Before diving into the specifics of the demo let’s take a look at the high level architecture.

  1. Using the Python Snowflake Python connector read data from Snowflake into notebook, parse data.
  2. Pass data to cognitive services for insights Transform insights into a data frame.
  3. Write to Azure Storage Stage where it’s automatically copied to Snowflake table.
  4. Create dashboard on results data.

Pre-requisites

  1. Azure subscription
  2. Snowflake account
  3. Azure storage account that is set up as an external stage. Details can be found here.

Setup

SNOWFLAKE

Create two tables in Snowflake, one that includes two columns: a patient_id column and sample text physician notes. The second table should be an empty results table:

create or replace TABLE TEXTANALYTICS.PUBLIC.MYTABLE (    PAT_ID NUMBER(38,0),    NOTES VARCHAR()    );
create or replace TABLE TEXTANALYTICS.PUBLIC.RESULTS (    DOC_ID VARCHAR(),    ENTITY VARCHAR(),    CATEGORY VARCHAR(),    NORMALIZED_TEXT VARCHAR(),    PAT_ID VARCHAR(),    NOTES VARCHAR()    );

AZURE

Provision a Language service in Azure and make sure you’re using the Standard Pricing Tier in order to take advantage of all the features on the Language service. You can check out the pricing on the “Pricing Tier” blade to make sure you’re not driving up a big Azure bill :-). Head to the “Keys and Endpoint” blade and copy Key 1 and the Endpoint.

Provision a storage account (this can be blob or ADLS Gen2) and establish a Snowpipe so that when a file is loaded to the storage account the data is automatically piped into a Snowflake table. Full details on establishing a Snowpipe can be found here.

create pipe taforhealth_pipe    auto_ingest = true    integration = notification_int    as    copy into results    from @azure_text_results_stage    ON_ERROR=CONTINUE    file_format = (type = ‘csv’);

Lastly, provision a resource to write Python code, you can use Azure Functions, Databricks or Synapse Spark. In this demo I spun up a Synapse Spark resource.

Pulling the data from Snowflake

This demo leans on the Snowflake Python connector to read and write data from Snowflake tables. The complete Python Notebook can be found here. Below are the packages that are required for import. The commented out lines show the required packages to setup the environment.

#%pip install "snowflake-connector-python>=2.6.2,<2.7"
#%pip install "snowflake-connector-python[pandas]"
#%pip install snowflake-sqlalchemy
#%pip install azure-ai-textanalytics
import snowflake.connector
import pandas as pd
import requests
import json
import time

From there establish a connection to your Snowflake table and query the data:

# connect to snowflake database
con = snowflake.connector.connect(
user='mmarzillo',
password='PASSWORD',
account='sn00102.central-us.azure',
warehouse = 'SMALL_WH',
database = 'TEXTANALYTICS',
)
# query table to pandas data frame
sql = "select * from PUBLIC.MYTABLE"

cursor = con.cursor()
cursor.execute(sql)

df_notes = pd.read_sql(sql, con)
df_notes['doc_id'] = df_notes.index
df_notes.head()

Then extract the NOTES column as a list:

# pull out list of text 
documents = df_notes["NOTES"].tolist()
documents

Establish a connection to your Language Cognitive Service and pass the text data to the service for insights that will be returned as the ‘docs’ object:

key = "KEY"
endpoint = "https://matt-marz-language.cognitiveservices.azure.com/"

from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential

# Authenticate the client using your key and endpoint
def authenticate_client():
ta_credential = AzureKeyCredential(key)
text_analytics_client = TextAnalyticsClient(
endpoint=endpoint,
credential=ta_credential)
return text_analytics_client

client = authenticate_client()
# pass docs to service
poller = client.begin_analyze_healthcare_entities(documents)
result = poller.result()

docs = [doc for doc in result if not doc.is_error]

Set this up to pull several entities like normalized_text and category. From the output below you can see that for PAT_ID 123456 the service is able to normalize the term “pink eye” to a SymptonOrSign that is normalized to “Conjunctivitis”. There are many more insights that can be leveraged from Text Analytics for Health such as Relation Extraction and Entity Linkage. More can be found here.

# reshape results into a pandas data frame
ent = []
normalized_text = []
category = []
doc_id = []
source = []
for idx, doc in enumerate(docs):
for entity in doc.entities:
doc_id.append(doc.id)
ent.append(entity.text)
normalized_text.append(entity.normalized_text)
category.append(entity.category)
source.append(entity.data_sources)

d = {'doc_id':doc_id, 'entity': ent, 'category': category, 'normalized_text': normalized_text}
df = pd.DataFrame(data = d)
# convert data types to match for join and join data sets
df['doc_id'] = df['doc_id'].astype(int)
df_out = pd.merge(df, df_notes)
df_out

Last write the files to our stage storage account and the table in Snowflake is created!

# Write file to stage
currentDateTime = time.strftime('%d-%m-%Y %H:%M:%S')
df_out.to_csv('abfss://stage@STORAGENAME.dfs.core.windows.net//results'+currentDateTime+'.csv', index = False,header=False)

CONCLUSION

With the insights in a tabular format you can now use this data for additional insights like building dashboards or building predictive models. I’m hoping this guide will get you started on using Azure Cognitive Services with Snowflake. Please check out the github repo with the full python notebook, sample data and deck with architecture.

--

--