Techniques for de-Identification Of Unstructured Data

Published in

mfine-technology

4 min readJul 3, 2019

De-identification is the process used to prevent a person’s identity from being connected with information. This post is about recognising entities from unstructured data and de-identify them. Chat data is a good example for an unstructured data and de-identification in this context would mean the identification and removal of names, occupations and addresses from text conversations.

What is the difference between structured and unstructured data?

Structured data is data that has been organised into a formatted repository, typically a database, so that its elements can be made addressable for more effective processing and analysis. Figure 1 shows an illustration of structured data.

Ex: names, dates, addresses, credit card numbers, stock information, geolocation, and more.

Figure 1. Illustration of structured data

Unstructured data is most often categorised as qualitative data, and it cannot be processed and analysed using conventional tools and methods.

Ex: Text, video, audio, mobile activity, social media activity, surveillance imagery — the list goes on and on.

For structured data, we can easily de-identify the columns which consists of personal information. But for unstructured data, it will be a difficult task to identify information containing personal details.

De-identifying unstructured data:

The task of automatic de-identification comes under Name Entity Recognition (NER). De-identification can be broken down into the following three categories:

Rule-Based Systems
Machine Learning Systems
Deep Learning Systems

1.Rule-Based Systems:

Rule-based systems make heavy use of pattern matching such as dictionaries (or gazetteers), regular expressions and other patterns. It needs extensive modification of rules to get better accuracy and lacks robustness. Critically, these systems cannot handle context which could render a medical text unreadable.

2. Machine Learning Systems:

To address the lack of robustness in rule-based systems, researchers turned to machine learning based approaches. These methods take a sentence as an input and predicts classes for each word in the sentence. Support vector machines (SVM), conditional random fields (CRF’s), and random forest models are some of the commonly used algorithms.

The drawback of ML-based systems is that most of the algorithms increase the likelihood of the predictions, rare patterns won’t be predicted properly as classification is a supervised learning task. It also relies heavily on feature engineering to predict and these methods are not robust in such a way that, the test accuracy will be higher but doesn’t work properly for a different type of dataset.

3. Deep Learning Systems:

With the disadvantages of both the approaches to building a de-identification system in mind, the current state-of-the-art systems employ deep learning techniques to achieve better results than machine learning systems while also not requiring the time-consuming process of feature engineering. Deep learning is a subset of machine learning that uses multiple layers of Artificial Neural Networks (ANNs), which has been very successful at most Natural Language Processing (NLP) tasks. Recent advances in the field of deep learning and NLP, especially with regard to named entity recognition, have allowed systems to achieve better results.

Preparing the Input Sentence for de-identification:

Figure 2. Flow diagram to prepare the input sentence

Figure 2 shows the process flow of preparing the input sentences for de-identification. The sentence is passed through a Preprocessor block which divides any given input sentence into words and removes stop words and punctuations.

Then, Tagger block gives parts of speech (POS) and capital tag (whether the 1st character is capital or not) to each word.

The output from Tagger block passes through Word2Vec, which converts the given word to vector based on the selected corpus which is pre-trained. As an example, Glove corpus can be used for this task.

Parallely, the output from Tagger block passes through One-Hot Encoding, which converts given labels (In training) into one-hot encodings.

Finally, the processed outputs pass through a deep learning architecture.

Deep Learning Architecture:

The important part of the architecture is the Bi-directional LSTM layer. The Bi-Directional LSTM layer is composed of two LSTM layers, which are a variant of the Bidirectional RNNs. In short, the Bi-LSTM layer contains two independent LSTMs in which one network is fed input in the normal time direction while the other network is fed input in the reverse time direction. This allows the model to uncover more patterns as the amount of input information is increased. In other words, the model not only considers the sequence of tokens after a token of interest but also before the token of interest. The outputs of the two networks were combined using concatenation.

Examples:

Further Improvements:

The architecture of Fig.3 can be improved further by including CNN or CRF modules before the prediction layer to get state of the art results (https://arxiv.org/pdf/1810.01570.pdf).