De-identification of Electronic Health Records using NLP

How to protect privacy when processing medical text.

Jan Trienes
Nedap
8 min readJul 15, 2020

--

At Nedap, we develop electronic health records (EHRs) to simplify and improve care. As such, we regularly process large amounts of privacy sensitive health data. This data has a lot of opportunity to improve care and advance research, but at the same time we need to ensure that the privacy of a patient is protected.

We were looking into ways to facilitate use of EHR data without having to compromise patient privacy. One challenge we quickly faced is the anonymization of free-text documents. The challenge with this unstructured data is that protected health information (PHI) first has to be reliably detected before it can be removed or masked. This process is known as de-identification (see figure below for an example). Ideally, after de-identification, it should be difficult to establish a link between an individual and the data.

In this blog post we compare three natural language processing (NLP) methods for de-identification. We look at both effectiveness (i.e., how well it detects PHI) and efficiency (i.e., how fast it is) of a method. We share all code to train, evaluate and use the de-identification methods in our GitHub repo. For more details, you can also take a look at our paper.

Overview of the de-identification process.

De-identification Approaches

De-identification is often framed as a named entity recognition (NER) problem where the entity types are the categories of PHI we want to mask. Three NLP techniques are commonly applied to this problem:

  1. Rule-based extraction. In a rule-based system, domain experts assemble a list of heuristics to identify PHI. To give an example, names can be found by determining whether a word is preceded with a title (e.g., Dr. <potential_name>). When a word context does not give such strong cues, lookup lists can be used (e.g., list of most common first/last names, street names, and organization names). Fuzzy string matching can help to account for spelling variants. More structured PHI such as dates, email addresses, and phone numbers are identified by pattern matching (regexes). Rule-based systems are often a strong baseline and can be useful in practice because of their high precision and transparency. However, recall is often limited as it is hard to create exhaustive lookup lists and to anticipate all possible formatting and spelling variants of PHI.
  2. Feature-based machine learning. An alternative to hand-crafted rules are sequence-labeling methods. These systems encode each token and the surrounding tokens as a set of features and a let a machine learning model such as a conditional random field (CRF) assign labels to each token indicating the presence of PHI. Example features are orthographic (e.g., token is capitalized or token is a digit) and linguistic (e.g., a part-of-speech tag of token). The method quality largely depends on how well the feature set is engineered. As we'll see below, these systems are cheap during training and inference and provide good results.
  3. Neural methods. State-of-the-art NER results are achieved with neural sequence labeling architectures such as a BiLSTM-CRF or transformers. Aside from using a more complex model, they replace the hand-crafted features with (often pre-trained) word embeddings that better capture the linguistic properties of a word. However, these methods are significantly more expensive during training and inference than the approaches mentioned above.

Below, we are going to implement and compare the three approaches. For more detail on the methods, I recommend reading Chapter 18 in Speech and Language Processing. Daniel Jurafsky & James H. Martin. 2019 (link).

Data

To develop and test de-identification methods, we need a dataset that matches our domain of healthcare (elderly care, mental care, and disabled care) and document language (Dutch). Because there are only few openly available de-identification datasets which did not match our language and domain requirements, we decided to use data from our own systems [1].

We created our dataset as follows. From our EHR database, we sampled 1260 documents with a total of ~450,000 tokens. We then annotated 16 PHI types (e.g., name, initials, age, profession). To ensure that no PHI was missed, we had two annotators read and annotate each document in parallel and independently. In a review stage, we checked and merged the two annotations. The process took about 80 hours of annotation time and 20 hours of review, a rate of 12.6 documents per hour — clearly, this shows that manual de-identification is infeasible on a large scale. Finally, we automatically replaced each PHI with an artificial but realistic alternative before using the data any further (a process known as surrogate generation).

Our annotated dataset has a total of ~17,000 PHI instances among the 16 categories. The figure below shows that the class distribution is quite skewed. PHI types like names and dates occur extremely often while other types such as social security numbers (SSN) and professions are quite scarce. This is a challenge for a machine learning method as it needs to generalize from fewer examples.

Distribution of PHI tag in dataset.

Model Implementation in Python

We implement the three de-identification approaches as follows:

  1. Rule-based system. An excellent rule-based system for Dutch EHRs is DEDUCE. It was developed for nursing notes and treatment plans in the psychiatric care. We slightly adapt the method to our annotation scheme and include a lookup list of 1200 institutes common in our domain. DEDUCE is available as a Python package (code and paper by Vincent Menger).
  2. Feature-based CRF. We re-implement a token-based CRF and the feature set by Liu et al. (2015) which achieved good results in English de-identification benchmarks. As a CRF implementation we use the Python sklearn-crfsuite binding to CRFSuite with elastic-net regularization. We optimize the two regularization coefficients of the L1 and L2 norms with a random search.
  3. BiLSTM-CRF. As a neural de-identification method, we implement a BiLSTM-CRF with contextual string embeddings (Flair Embeddings) which recently provided state-of-the-art results in NER. We concatenate the pre-trained Dutch contextual string embeddings with Dutch fasttext embeddings. We set the hyperparameters to the defaults in Flair.

All three methods have a common preprocessing routine: we use spaCy for sentence segmentation and tokenization. For the machine learning methods, we label each token according to the BIO tagging scheme.

For more details, you can take a look at our code on GitHub: https://github.com/nedap/deidentify.

Results

To evaluate the de-identification methods, we look at the entity-level precision, recall and F1 scores (the standard evaluation approach for NER systems). We split our datasets into training, validation and testing sets with a 60/20/20 ratio.

Effectiveness: De-identification Quality

Let us first take a look at the effectiveness of the three methods in terms of the aggregated evaluation metrics in the table below.

Summary of evaluation results: micro-averaged scores are shown for each de-identification method.

Some observations can be made:

  • Both machine learning methods (CRF and BiLSTM-CRF) outperform the rule-based system DEDUCE by a large margin.
  • The BiLSTM-CRF provides a substantial improvement of 10% points in recall over the traditional CRF method, while maintaining precision.
  • The precision of the rule-based method is fairly high with 0.81 which shows that many of the rules are well designed. However, as we’ll see below, we found that the lookup lists and pattern matching of the rule-based method don’t work well for PHI with a lot of variability (e.g., organization names, dates, addresses).

Model effectiveness per PHI tag for the BiLSTM-CRF and DEDUCE

We now take a closer look at the effectiveness per PHI category to better understand what types of PHI we can expect an automatic de-identification method to capture (see figure below).

We can see the following:

  • The neural method performs at least as good as rule-based method for all PHI categories.
  • Initials, IDs and professions are the hardest types to detect, even for the machine learning method. There are two issues at hand. First, initials and IDs can be hard to distinguish from abbreviations and medical measurements. Second, the profession PHI category has a high variability (think about all the different ways you could explain your job title) and therefore requires a lot of training data to be picked up by a machine learner.
  • The “Other” PHI category can’t be captured by any of the automatic de-identification approaches. This has to be kept in mind when using de-identified records downstream. The records may still contain information which can be directly identifying (maybe combined with background knowledge).

Impact of Training Set Size

We have seen that both machine learning methods outperform the rule-based method by a fair margin. As it is expensive to get annotated training data for the ML methods, we were curious how the size of the training set influences model quality.

Surprisingly, we find that both machine learning methods outperform DEDUCE even with as little training data as 10% of the data (see figure below). This suggests that the machine learning methods are to be preferred whenever training data is available or can be obtained.

Micro-averaged F1-score on the test set for varying training set sizes. The full training set (100%) consists of all training and validation sentences in our dataset (34,714).

Efficiency

Finally, we take a look at the inference efficiency of each method [2]. We observe that the BiLSTM-CRF is up to 6 times slower than DEDUCE and the CRF. We were able to train a cheaper variant of the neural method (BiLSTM-CRF fast) that uses a smaller Flair embedding layer with 1024 instead of 2048 hidden units and no pooling. This significantly reduced the number of model parameters from ~158M to ~20M while only having a minor impact on quality (F1 0.8999 vs. 0.8904).

Processing duration to de-identify 5000 sentences of the Dutch CoNLL03 corpus (lower is better).

Conclusion

In this blog post, we compared three NLP approaches to de-identification and we evaluated their effectiveness and efficiency. In short, we found the following.

  • Rule-based methods can provide a strong baseline and are computationally cheap.
  • Machine learning methods (CRF and BiLSTM-CRF) are preferable whenever (even small) training datasets are available.
  • Machine learning methods lack the explainability/transparency of rule-based methods and are more expensive from a computational perspective.

There are other interesting aspects of de-identification such as generalizability and robustness across different domains of healthcare. Generalizability is an ongoing NLP research topic, so we expect more results and techniques on that in the future. Lastly, we have seen that none of the de-identification methods can provide perfect de-identification quality. This makes it especially challenging to deploy de-identification methods in practice. Other (non-technical) processes are needed to mitigate the privacy risks of exposing PHI.

Notes

  • [1] Among the few available de-identification benchmark datasets are the English i2b2/UTHealth and nursing notes corpora.
  • [2] Our benchmark system has a GeForce RTX 2080 Ti GPU and an Intel Xeon Gold 6126 CPU with ~400GB of RAM.

--

--

Jan Trienes
Nedap
Writer for

Data Scientist @ Nedap Healthcare. I work on methods to extract knowledge from written text using NLP and ML.