STATE-OF-THE-ART TECH

How Therapists and Data Scientists are Protecting your Privacy

Ilan Kahan
Eleos Health
Published in
9 min readNov 3, 2022

--

Let’s say you have a document that you wish to keep, however, it contains a client’s personal information. You don’t need the client’s personal information, but the rest of the document is valuable to you, so you erase their information, store the file, and move on. But what happens if you have tens of documents? Could you repeat this process? What if you have a company with millions of such documents? In this post, I’ll show you how Data Science can solve this problem!

Data Privacy and De-Identification 🔒

Nobody wants their personal and confidential data leaked, especially to ill-intended people. As technology becomes increasingly part of our lives, the number of companies that can access our data grows, and so does the concern about data leaks. These companies invest a lot of time, money, and effort in cyber security to try and avoid these leaks, but as we see it on the news, they certainly happen. In large part because of that, data privacy is a concept that is ever more present in public discourse as consumers, companies, regulatory authorities, and everyone in between gain a better understanding of this issue.

One way organizations can provide their users with data privacy and minimize the impact of a data leak is through data de-identification. As the name suggests, its purpose is to remove information that may link data to a person, making it harder to identify the said person using the data.

HIPAA📚

If you are familiar with HIPAA, you are welcome to skip this section.

Some industries have a higher regulatory standard regarding data than others. For instance, in the US, The Health Insurance Portability and Accountability Act of 1996 (or HIPAA in short) dictates what type of information should be safeguarded in a healthcare setting. HIPAA specifies 18 categories of Protected Health Information (PHI), which are “any health information that can be tied to an individual.” This is a perfect example of a space where de-identification can be applied, as removing that information from health-related documents will protect the individual’s data from being linked to them in case of a leak. The PHI categories are:

  • Names
  • All geographical identifiers smaller than a state
  • Dates (other than year) directly related to an individual
  • Phone Numbers
  • Fax numbers
  • Email addresses
  • Social Security numbers
  • Medical record numbers
  • Health insurance beneficiary numbers
  • Account numbers
  • Certificate/license numbers
  • Vehicle identifiers (including serial numbers and license plate numbers)
  • Device identifiers and serial numbers
  • Web Uniform Resource Locators (URLs)
  • Internet Protocol (IP) address numbers
  • Biometric identifiers
  • Any other unique identifying number, characteristic, or code except the unique code assigned by the investigator to code the data

HIPAA is applicable, but not limited to, doctors, clinics, psychologists, dentists, chiropractors, nursing homes, and pharmacies.

Eleos Health ⭐️

Clinicians spend a significant chunk of their time during sessions taking notes to record the key moments and track the patient’s progress. Eleos Health aims to reduce this operational burden by using Augmented Intelligence. Eleos suggests key moments and generates insights, allowing clinicians to focus solely on the patient and their treatment, instead of on note-taking. That empowers the clinician while unlocking objective insights into evidence-based care.

A popular Data Science branch called Natural Language Processing (NLP) powers this technology. NLP combines different statistical, machine learning, and deep learning methods to understand language structure, derive meaning from text, and even its sentiment and intent. By combining the transcription of a given session and NLP models, we can derive many of the insights that empower clinicians.

To give patients as much data privacy as possible, these transcripts need to be de-identified according to HIPAA standards, which poses a few challenges. The sessions are usually around an hour and have a wide variety of subjects, thus impractical to be manually de-identified. Also, the sessions are transcribed by a speech-recognition system that can lead to typos, miss-transcriptions, and other particularities that increase the complexity of a de-identifying pipeline.

So how can we de-identify therapy session transcripts effectively? With Data Science!

1:1 representation of a Data Scientist at work

NLP can produce impressive results in many applications. However, NLP is a broad term, so we need to understand which of its areas can help us achieve our goals.

Named Entity Recognition 🔎

De-Identification can be seen as an application of a popular task in NLP called Named Entity Recognition (NER). NER is a subset of Information Extraction, which consists in extracting relevant information from text based on its meaning. By understanding the semantics of a text, we can recognize its entities, that is, identify the words in the text that represent things such as locations, names, organizations, and others. Once these entities are detected, we can select the relevant ones and remove them from the text, effectively de-identifying them.

Example of NER using the Spacy library in python

NER comes with its challenges, one of the main ones being ambiguity. As we see in the above example with Washington, the same word can have multiple meanings depending on its context, possibly affecting the NER category it belongs to. Another form of ambiguity can happen with entities containing multiple words, such as George Washington: if we consider each word separately, both their meaning and entity category might change.

Some commonly used Python libraries for NER are NLTK, Spacy, and Flair, all of which have their own set of benefits and disadvantages.

Building a De-Identification model 👷️️

After understanding the objective of de-identification and the NLP task that can help us achieve it, the next step is to build a model that can achieve satisfactory results. The initial hypothesis was that by using publicly available models, we would be able to accomplish our goals. But before testing that, we needed to have tagged data so we could measure performance. Our team went through 100 therapy sessions, tagging each occurrence of a relevant entity. Not every PHI category was present in these sessions, so tests for Email, URLs, and others happened at a later moment.

The graph below shows the distribution of entities per category. The categories are not a one-to-one reflection of the HIPAA PHI categories, as we grouped some of them. For example, the Numerical category groups all categories composed of numerical values. Even though HIPAA doesn’t require the safeguard of medical conditions, we added a Medical category that includes those types of entities.

Categories NAME and GEO represent 92% of the labeled data.

As we are dealing with sensitive data, the main concern was identifying all relevant entities, even if that meant the model would identify more entities than necessary. In other words, initially, we were measuring and focusing on the recall of the model and were less concerned with the precision. The goal was to achieve a recall of at least 90%.

There are many open-source libraries available for NER tasks in python. We tested models individually but concluded that the best approach would be to combine models. We opted for an ensemble of models composed of the publicly available libraries Spacy, Flair, and SciSpacy, combined with Regex for some use cases. Regex, or Regular Expressions, is a language for searching and matching patterns, which was very useful for the Names category, for example. Names are sometimes transcribed in lowercase by the speech-to-text model, and we noticed that the NER models were missing a few lower-cased versions of names. We used Regex to identify all occurrences of a name, regardless of the casing, based on the ones identified by Spacy and Flair.

The version of each model that was used, and the categories it identifies

In the example below, we can see that Spacy didn’t identify the lowercase version of Eric.

Eric’s pool is cool

By using Regex we can search the sentence for all occurrences of Eric, regardless of casing.

Eric’s Laos house also has a cool pool

After some testing and evaluation, we adjusted some aspects of the models. These models identify NER categories that aren’t relevant to this task and were removed from the pipeline, such as WORK_OF_ART and LAW. Spacy and Flair tagged many words that don’t need to be de-identified in the DATE category, such as tomorrow, Wednesday, and old, among others, so we manually created exceptions for this category. These changes increased the precision while having a minimal impact on the recall.

Having made these adjustments, we evaluated the performance of the models on our data. The recall was 98% across all categories, well above the goal of 90%. Despite not being the focus, the global precision was also high at 87%.

Our test dataset didn’t contain examples of emails, passwords, or URLs. At first glance, identifying these entities appears to be a simple task for Regex, as we could search for emails with the pattern “name@domain.com”, for example. However, as our data consists of transcriptions these categories are much less consistent and may appear in sentences such as “my email is name at domain dot com”. Taking this into account, we generated Regex patterns that look for patterns such as “dot com”, “at gmail dot com”, “password is”, “www dot com”, among others.

Applying the De-Identification model 🤖

With the de-identification model ensemble ready, the next step was to move it to production. Given the nature and format of our data, each transcription needs to be de-identified sentence by sentence. Hence, the code loops through each sentence, identifying and storing its entities. It often is the case that multiple models identify the same entity, so only the one with the largest span is kept. Regex for identifying names runs last, taking each name identified by other models and searching for them regardless of casing in all sentences.

The sentences are de-identified by replacing the entities with their category:

Transcript of a simulated conversation between [NAME] and [NAME] de-identified

In Brief📋

This post has evidenced the research that has gone into building a De-Identification model. In detailing that research, the post has explored and defined some key terms:

  1. Data Privacy
  2. De-Identification
  3. HIPAA and PHI
  4. NLP
  5. NER
  6. What it takes to make a De-Identification model

I hope you enjoyed the post and learned something new from it!

If you would like to know more about Eleos Health, our technology, and how we contribute to the behavioral health environment, visit us here!

Sources:

--

--