An Example of Solving Health Care Identity Management's task with Snowflake and Openai API

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

10 min readFeb 5, 2024

Discover how a healthcare identity problem can be efficiently addressed by seamlessly integrating Snowflake User-Defined Functions (Snowpark UDF) and the OpenAI API.

Photo by Andrew Neel on Unsplash

Problem description

One of the numerous challenges in healthcare data analytics is identity management. The healthcare sector is a vast environment encompassing various systems, some of which are quite outdated and, unfortunately, lack standardization. Let’s look into one of the issues that healthcare is grappling with identity management.

You may have had the opportunity to submit a claim, such as when you’ve undergone a treatment for which you’ve made a payment and now wish to seek reimbursement from your insurance. The claim process involves several steps: selecting the patient, specifying the treatment, choosing the treatment provider, and outlining the details of the expenses.

When selecting the provider, you must provide the name, phone number, and postal code. However, all of these elements are flawed. Names can vary, including the inclusion of middle names or alternate spellings like Kris or Chris. Additionally, a provider may have multiple phone numbers and operate in various locations. Claims can be submitted online or by paper, which might introduce more places for spelling mistakes and different variations of the same provider name. Consequently, when processing a claim with provider details, the system may need to match it to records stored in other systems with some flexibility, assigning a score to the match. A low score may prompt manual auditing for accuracy.

Solution description

Naturally, all these challenges could be effectively addressed if each provider were assigned a unique ID for a specific location and the services they offer or if there was a standardization for storing identity information within the system. However, the current situation aligns differently with this idea, as unique IDs and standardization are not employed.

Until now, the approach has involved using an application with a predefined set of rules. When an identity is queried, these rules are applied to determine the relevant information.

For instance, when seeking a provider by searching for similar names while considering common spelling errors, we can implement fuzzy string match with Python, different algorithm’s strings distance calculations, or employ the following SQL query (if our database engine supports NLP):

SELECT *
FROM providers_table
WHERE SOUNDEX(name) = SOUNDEX('Alla Ozeroda');

However, configuring and adjusting all these predefined rules can be labor-intensive. Moreover, the outcomes may not meet the desired accuracy, and beyond that, the system cannot learn and improve over time. What we aspire to explore is the possibility of leveraging Large Language Models (LLM) for identity searches. Furthermore, we aim to determine whether this can be seamlessly integrated into our existing data storage, specifically our Snowflake data warehouse.

LLM

LLM typically stands for “Large Language Model,” an artificial intelligence model designed to understand and generate human-like text at a large scale. OpenAI is an artificial intelligence research laboratory consisting of the for-profit OpenAI LP and its non-profit parent company, OpenAI Inc. OpenAI is focused on developing and advancing artificial general intelligence (AGI) in a safe and beneficial manner. They have released several versions of their language models. GPT -3 (Generative Pre-trained Transformer 3) is one of the most advanced and well-known examples of a large language model developed by OpenAI.

We can first test our intention by prompting chatgpt with the question we are trying to solve:

Important: Security

Sending Personally Identifiable Information (PII) over the web to a third party, such as OpenAI, is not advisable, as it poses the risk of exposing sensitive data, like your list of clients. If the data is non-sensitive, utilizing the OpenAI API may be suitable. However, for handling sensitive information, a more secure solution would involve deploying the Language Model (LLM) directly within Snowflake.

In the context of a Proof of Concept (PoC), I have chosen to conduct a quick assessment using the OpenAI API to gauge the quality of results. For this purpose, dummy values will be used. If the outcome proves satisfactory, the next step would be to deploy the Language Model directly within Snowflake, ensuring a more secure environment for handling sensitive data.

Prerequisites

Suppose you’re interested in conducting a Proof of Concept (POC) based on the upcoming demonstration I’ll outline. In that case, you’ll require a Snowflake account (a trial account is sufficient), an OpenAI user account, and an API key. I upgraded my OpenAI user account to a paid status. When adding your Visa information to the OpenAI account, you can set a limit of $5 USD.

Preparing the POC

First, let’s add some data to our Snowflake.
In my code, I usually prefix my objects with EYLON_ as I am using a shared Snowflake account. You can either drop the prefix or change it to your name. Then, let’s create a table with two columns of names in the same cases where the names are identical. In others, it’s the same name but in a different variation (nickname, spelling mistake, etc.), and names that do not match.

-- Create Database
CREATE DATABASE IF NOT EXISTS EYLON_IDENTITY_DB;

-- Use Database
USE DATABASE EYLON_IDENTITY_DB;-- Create Schema
CREATE SCHEMA IF NOT EXISTS EYLON_IDENTITY_SCHEMA;-- Use Schema
USE SCHEMA EYLON_IDENTITY_SCHEMA;-- Create Table
CREATE TABLE IF NOT EXISTS names_table (
    name_a STRING,
    name_b STRING
);TRUNCATE TABLE names_table;INSERT INTO names_table VALUES
    ('John Doe', 'Johnnie Doe'), -- Same name
    ('Alice Smith', 'Alicia Smyth'),-- Same name
    ('Michael Johnson', 'Mike Johnston'),-- Same name
    ('Emily Brown', 'Emilia Braun'),-- Same name
    ('Christopher Lee', 'Chris Li'),
    ('Jessica Taylor', 'Jessie Taylor'),-- Same name
    ('Andrew Miller', 'Andy Mueller'),-- Same name
    ('Olivia White', 'Olive Whyte'),-- Same name
    ('Daniel Harris', 'Dan Harrison'), -- ???
    ('Sophia Clark', 'Sophie Clarke'), -- Same name
    ('David Wilson', 'David Wilson'), -- Same name
    ('Ella Turner', 'Ella Turner'),   -- Same name
    ('William Baker', 'Walter Baker'), -- Different names
    ('Grace Mitchell', 'Graham Mitchell'), -- Different names
    ('Natalie Martin', 'Nicholas Martin'), -- Different names
    ('Isaac Young', 'Ivy Young'), -- Different names
    ('Emma Johnson', 'Ethan Johnson'), -- Different names
    ('Lily Garcia', 'Leo Garcia'), -- Different names
    ('Victoria Davis', 'Vincent Davis'), -- Different names
    ('Sophie Moore', 'Sebastian Moore'); -- Different names

Manually reviewing an extensive list of names can be labor-intensive. Therefore, we seek a mechanism to distinguish between identical and distinct names. Let’s explore the use of OpenAPI to facilitate this process.

Let’s start by establishing the integration with openai

CREATE OR REPLACE SECRET eylon_open_ai_api
 TYPE = GENERIC_STRING
 SECRET_STRING = 'take your secret from openai API keys and put it here';

CREATE OR REPLACE NETWORK RULE eylon_openai_apis_network_rule
  MODE = EGRESS
  TYPE = HOST_PORT
  VALUE_LIST = ('api.openai.com');
CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION eylon_openai_external_access_int
   ALLOWED_NETWORK_RULES = (eylon_openai_apis_network_rule)
   ALLOWED_AUTHENTICATION_SECRETS = (eylon_open_ai_api)
   ENABLED = true;

Our User-Defined Function (UDF) will leverage OpenAI by sending a prompt to assess the probability of name matches. If the resulting score surpasses 0.8, we’ll consider it the same name; otherwise, a score below indicates a likely mismatch.

CREATE OR REPLACE FUNCTION CHECK_NAMES_GPT(name_a varchar, name_b varchar)
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = 3.9
HANDLER = 'complete_me'
EXTERNAL_ACCESS_INTEGRATIONS = (eylon_openai_external_access_int)
SECRETS = ('openai_key' = eylon_open_ai_api)
PACKAGES = ('openai')
AS
$$
import _snowflake
from openai import OpenAI

client = OpenAI(
    api_key=_snowflake.get_generic_secret_string('openai_key')
)prompt = "I will give you two names\
Please reply with a number that represents the probability that both are likely to represent the same name. You will provide explanations.\
The format of your reply should be a json in which the first field is the probability and the second field is the explanation.\
Here are the two names I want you to compare:"def complete_me(NAME_A, NAME_B):
    messages=[
        {'role': 'assistant', 'content':f"{prompt} {NAME_A} and {NAME_B}"}
    ]
      
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages,
        temperature=0
    )
    return response.choices[0].message.content.strip()$$;

Ok, now let’s give our code a try

SELECT
  name_a, name_b, CHECK_NAMES_GPT(name_a, name_b) AS names_match_result
FROM
  names_table
ORDER BY
    name_a;

And we will get the result of which addresses match and which are not.

Table with probability of identical names column

For example:

‘Andrew Miller’ and ‘Andy Mueller’ got a score of 0.8, so there is a good chance it’s the same name.

‘David Wilson’ and ‘David Wilson’ got 1 since it’s identical.

‘Grace Mitchell’ and ‘Graham Mitchell’ got 0.2, so, probably different names.

We requested GPT to provide the result in JSON format for convenient integration into our application.

It’s important to note that results may vary across different runs.

While the majority of results appear satisfactory, there’s room for refinement in certain cases. For instance, the comparison between ‘Victoria Davis’ and ‘Vincent Davis’ generated a response with a probability of 0.8. ‘{
“probability”: 0.8,
“explanation”: “Both names have the same last name ‘Davis’ which suggests a higher probability of being related. However, the difference in the first names ‘Victoria’ and ‘Vincent’ suggests a lower probability of being the same name.”
}’

To enhance accuracy, we aim for a lower score in scenarios where the first names significantly differ, indicating a reduced likelihood of being the same name.

Now it’s time for prompt engineering.

Prompt engineering refers to the process of carefully crafting or designing the prompts given to language models, like GPT-3, to achieve specific desired outputs. In the context of models like GPT-3, which are based on large language models, prompt engineering involves formulating queries, instructions, or inputs in a way that guides the model to generate the desired responses.

To enhance the accuracy of our results, we plan to modify the prompt within our User-Defined Function (UDF). However, before doing so, let’s make our table more diverse and intriguing.

TRUNCATE TABLE names_table;

INSERT INTO names_table VALUES
    ('John Doe', 'Johnnie Doe'),
    ('Alice Smith', 'Alicia Smyth'),
    ('Michael Johnson', 'Mike Johnston'),
    ('Emily Brown', 'Emilia Braun'),
    ('Christopher Lee', 'Chris Li'),
    ('Jessica Taylor', 'Jessie Taylor'),
    ('Andrew Miller', 'Andy Mueller'),
    ('Olivia White', 'Olive Whyte'),
    ('Daniel Harris', 'Dan Harrison'),
    ('Sophia Clark', 'Sophie Clarke'),
    ('David Wilson', 'David Wilson'), -- Same name
    ('Ella Turner', 'Ella Turner'),   -- Same name
    ('William Baker', 'Walter Baker'), -- Different names
    ('Grace Mitchell', 'Graham Mitchell'), -- Different names
    ('Natalie Martin', 'Nicholas Mask'), -- Different names
    ('Isaac Young', 'Ivy Young'), -- Different names
    ('Emma Johnson', 'Ethan Johnson'), -- Different names
    ('Lily Garcia', 'Leo Garcia'), -- Different names
    ('Victoria Davis', 'Walter Baker'), -- Different names
    ('Victoria Davis', 'Vincent Davis'), -- Different names
    ('Sophie Moore', 'Moore Sebastian'); -- Different names

And now, let’s change our prompt to:

prompt = "I will give you two names\
Please reply with a number that represents the average between the probability that both are likely to represent the same name\
and the probability that both first names are likely to represent the same first name\
and the probability that both surnames are likely to represent the same surname.\
You will provide explanations.\
The format of your reply should be a json in which the first field is the probability and the second field is the explanation.\
Here are the two names I want you to compare:"

Let’s have a look at the new results:

We can see now that ‘Victoria Davis’ and ‘Vincent Davis’ generated a response with a probability of 0.75, which is more like what we would like to see.

We now have the opportunity to fine-tune our results by either examining specific cases where we anticipate a different score or by providing more guidance and rules to the model through adjustments in the prompt.

With just under 100 lines of code, we’ve established a system capable of comparing names, providing scores for the comparisons, and offering explanations. Now, we can set a threshold, such as considering names with scores of 0.8 or higher as identical, scores below 0.75 as non-identical, and scores at 0.75 for potential human review. This streamlined approach facilitates efficient decision-making in the comparison process.

Things to consider

When considering the implementation of language models like GPT-3, several crucial challenges come into play:

1. Privacy Concerns:
— Issue: Handling Personally Identifiable Information (PII) raises concerns about transmitting sensitive data over the internet to external services like OpenAI.
— Consideration: Hosting a Language Model (LLM) on your own server is a solution, but it introduces complexities such as environment deployment, model training, and ongoing maintenance, incurring both time and cost.

2. Non-Deterministic Results:
— Issue: The model may yield non-deterministic results, leading to variations in responses for similar inputs.
— Consideration: This poses challenges if your business logic requires consistent and deterministic replies, necessitating careful consideration of model behavior.

3. Auditing Requirements:
— Issue: Business scenarios demanding a traceable model decision path may require logging prompts and model parameters.
— Consideration: Implementing robust auditing mechanisms becomes crucial to track how the model arrived at specific answers and ensure transparency.

4. Bias in Models:
— Issue: Models trained on specific datasets may exhibit biases, impacting the fairness of responses.
— Consideration: Establishing procedures to identify and mitigate bias in model outputs is essential to align with ethical and fairness standards.

5. Prompt Engineering Challenges:
— Issue: While certain tasks, like comparing names, might yield satisfactory results, more complex tasks, such as comparing addresses, may present difficulties.
— Consideration: Addressing challenges in prompt engineering requires substantial effort to refine the model’s performance for specific use cases.

Conclusion

While the OpenAI API can provide a straightforward solution for certain problems, it’s crucial to be mindful of associated considerations. In this post, I illustrated a use case involving the comparison of names, a task that was relatively simple to implement. However, when extending the application to compare addresses, I encountered challenges, and even after investing several hours in prompt engineering, the results did not meet the desired level of accuracy. This highlights the importance of understanding the limitations and nuances of the chosen solution, especially when dealing with more complex tasks.

Summary

In this blog post, we explored the creation of a straightforward Snowflake’s User-Defined Function (UDF) and incorporated the OpenAI API to solve a real use-case scenario from the Health Care domain.

To stay updated on more Snowflake-related posts, follow me at my Medium profile: Eylon’s Snowflake Articles.

I’m Eylon Steiner, Engineering Manager for Infostrux Solutions. You can follow me on LinkedIn.

Subscribe to Infostrux Medium Blog at https://blog.infostrux.com for the most interesting Data Engineering and Snowflake news. Follow Infostrux’s open-source efforts through GitHub.