Personal Identifiable Information (PII) extraction using Watson NLP Library

Sahil Desai
Towards Generative AI
5 min readJul 25, 2023

Preserving Confidentiality: A Comprehensive PII Extraction Tutorial with Watson NLP

Personal Identifiable Information (PII) extraction is the method of detecting and retrieving personal information from many sources such as websites, databases, and documents. PII includes any data that can identify an individual, including but not limited to their name, address, phone number, email address, social security number, driver’s license number, credit card number, and more.

This blog provides details on how to extract PII entities from various sources using pre-trained models. It also shows how to train custom PII entities by fine-tuning the models to detect specific types of PII that may not detect the pre-trained models.

Pre-trained models are already trained on large datasets to recognize common PII entities. By using the Pre-trained models, users can get quick results with good accuracy for extracting the PIIs.

The pre-trained models may not cover all types of PII entities, fine-tuning models can help to train custom PII entities. Fine-tuning models means training a model on a smaller dataset that includes the specific PII entities you want to identify. This process can help improve the accuracy of PII entity recognition for your specific use case.

PII extraction using pre-trained models

There are two ways to identify and extract PII entities from text. The first one is a rule-based model. This model is ideal to detect numeric entity types such as phone numbers, emails, and numbers. It uses predefined rules to recognize and extract these entities from text.

The second model is suitable for more complicated entity types such as organizations, locations, and people. This model in trained on labeled data and uses deep learning methods to learn patterns and connections between words and the corresponding entity types to recognize and extract entities from text.

To extract the PIIs, we are using pre-trained Watson NLP models. You can download and use these models by using the below steps:

# Load a syntax model to split the text into sentences and tokens
syntax_model = watson_nlp.load(watson_nlp.download('syntax_izumo_en_stock'))
# Load bilstm model in WatsonNLP
bilstm_model = watson_nlp.load(watson_nlp.download('entity-mentions_bilstm_en_pii'))
# Load rbr model in WatsonNLP
rbr_model = watson_nlp.load(watson_nlp.download('entity-mentions_rbr_multi_pii'))

After loading the model, we can use the run() function of Watson NLP to extract entities. This enables us to view various kinds of entities such as persons, credit card numbers, and social security numbers.

BiLSTM Model Result

The above results show that the Pre-Trained BiLSTM Model is capable of identifying the name “Lori Gross” as a person’s name. Now the next result shows that the RBR model is capable to detect credit card numbers and social security numbers.

RBR Model Result

The notebook includes practical code that can try out this feature.

PII extraction using fine-tuned models

To get started, use the Faker library to generate the training data to create custom PIIs. Then, fine-tune a model by creating a sentence that incorporates this information. To enable the model to recognize PII entities and assign the appropriate labels to them, you can label these entities by passing their index locations along with their corresponding labels.

Fine-Tune BiLSTM Model for PII Extraction

Once you have created a labelled training dataset with samples of PII entities, you can fine-tune a BiLSTM model for PII extraction using the following code.

#Fine-Tune BiLSTM model using Custom PII
bilstm_custom = bilstm_model.train(train_iob_stream,
dev_iob_stream,
embedding=glove_model.embedding,
num_train_epochs=5,
learning_rate=0.005,
lstm_size=16)

In the above Fine-tuning, train_iob_stream is the training data ,dev_iob_stream is the testing data of 1000 sentences, and glove_model.embedding is glove embedding for encode the text.

BiLSTM FIne-Tune Result

As per the above result, fine-tuned BiLSTM model can identify all trained custom PII entities as SocialSecurityNumber, CreditCardNumber, Name, employee_id, degree_level, filed_of_study, and salary.

Fine-Tune SIRE Model for PII Extraction

A labeled training dataset with samples of PII entities use to fine-tune a Sire model for PII extraction.

#Fine-Tune SIRE using custom PII
sire_custom = watson_nlp.blocks.entity_mentions.SIRE.train(train_iob_stream,
'en', mentions_train_template,
feature_extractors=[default_feature_extractor])

In the above Fine-tuning, train_iob_stream is the training data which includes 10,000 sentences, en is the language code for English, and mentions_train_template is the SIRE model entity mention template which we load in the beginning, it is base training template for entity mentions SIRE block using the CRF algorithm.

SIRE Fine-Tune Result

As per the above result, fine-tuned SIRE model can identify all trained custom PII entities as SocialSecurityNumber, CreditCardNumber, Name, employee_id, degree_level, filed_of_study and,salary.

The notebook includes practical code that can try out this feature.

Conclusion

Using Watson NLP models which can train and adjust makes it easier to detect Personal Identifiable Information (PII) in text. These pre-trained models are good at recognizing the most common types of PII. If you fine-tune them with industry-specific data, they can do an even better job at specific types of PII and language used in a particular industry. By using these models, you can simplify the process of extracting PII, make sure your data is accurate, and follow privacy and data security rules.

Follow Towards Generative AI for more technical content related to AI.

Subscribe to the 3 min newsletter to learn about 3 most impactful things in Generative AI every week.

You can start your AI journey by browsing & building AI models through a guided wizard.

The IBM Build lab team is here to work with on your AI journey. For more information, Embeddable AI Webpage.

You can browse the collection of self-serve assets on Github, and if you are an IBM Business Partner, you can also browse the collection of Embeddeble AI self-serve assets at TechZone.

--

--