Text anonymization with Presidio and Faker

6 min readSep 13, 2021

Introduction

On one of my projects I faced the problem of dealing with Personal Identifiable Information (PII). To share our data with third-parties we decided to add an anonymization step to the preprocessing. In this article I will describe an example of data anonymization using two awesome libraries: Presidio and Faker.

Agenda:

Presidio analyzer for finding sensitive data
Presidio anonymizer
Faker for generating diverse synthetic entities
final pipeline for text anonymisation

The desired result will look like the following:

The link to the full notebook with examples will be shared in the end of the article.

Analyzer

At first, we need to install the library:

!pip install presidio-analyzer

Below you will find a brief diagram of how it works. Generally, it uses NER, regex, and rules for detecting personal information.

Presidio Analyzer diagram from the documentation

Presidio supports both SpaCy and Stanza as its internal NLP engine for finding named entities. The details may be found here. I prefer SpaCy and that’s why I will download their model.

!python -m spacy download en_core_web_md

Then we need to expricitly select this model because by default the analyser uses spacy en_core_web_lg for English language

# Create configuration containing engine name and models
configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "en", "model_name": "en_core_web_md"}],
}

Having the configuration, we can create the analyzer

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider# Create NLP engine based on configuration
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine = provider.create_engine()# the languages are needed to load country-specific recognizers 
# for finding phones, passport numbers, etc.
analyzer = AnalyzerEngine(nlp_engine=nlp_engine,
                          supported_languages=["en"])

Let us analyze one example

example_text = "Hi. My name is Oleg. I was born in Saint-Petersburg, Russia in 1997. Some random phone number: 51-855-831-2384. Yesterday I ate soup. Send something there helpline@lgbt.foundation. IBAN example AT483200000012345864"# language is a required parameter. So if you don't know 
# the language of each particular text, use language detector
results = analyzer.analyze(text=example_text,
                           language='en')for res in results:
    print(res)

Here we see entities, their start and end indices, and kind of a confidence score. If you want to analyse only a limited set of entities, pass a list with their names to a corresponding parameter of the analyze function.

And here is the same text with entities marked and coloured with a random non-black colour

We see that there is a collision at the end as a domain is a part of an email address.

From the documentation:

As the input text could potentially have overlapping PII entities, there are different anonymization scenarios:
• No overlap (single PII) — single PII over text entity, uses a given or default anonymizer to anonymize and replace the PII text entity.
•Full overlap of PIIs — When one text have several PIIs, the PII with the higher score will be taken. Between PIIs with identical scores, the selection will be arbitrary.
• One PII is contained in another — anonymizer will use the PII with larger text.
• Partial intersection — both will be returned concatenated.

Also, not all entities are captured well: the phone number is cut. For this purpose custom recognisers are made. They may be used for your particular patterns. These may be used, for examle, for finding urls, non-common symbol sequences, specific phrases.

Finally, a decision for each entity may be explained. Details on the decision process.

Anonymizer

Anonymizer is the second pillar of the presidio library.

Again, the library:

!pip install presidio-anonymizer

From the diagram below we see that on the input side the anonymizer expects the original text and the detected PII from the analyzer.

Presidio Anonymizer diagram from the documentation

As analyzer, anonymizer also has an Engine. Let us launch it!

from presidio_anonymizer import AnonymizerEngineanonymizer = AnonymizerEngine()anonymized_text = anonymizer.anonymize(text=example_text, analyzer_results=results).textprint(anonymized_text)

We can see that by default the entities are replaced with their entity name. Quite well. But can we make it more flexible? Of course! presidio has operators for this.

from presidio_anonymizer.entities.engine import OperatorConfigoperators={"PERSON": OperatorConfig(operator_name="replace", 
                                    params={"new_value": "REPLACED_NAME"}),
           "LOCATION": OperatorConfig(operator_name="mask", 
                                      params={'chars_to_mask': 10, 
                                              'masking_char': '*',
                                              'from_end': True}),
           "DEFAULT": OperatorConfig(operator_name="redact")}anonymized_text = anonymizer.anonymize(text=example_text, 
                                       analyzer_results=results,
                                       operators=operators).textprint(anonymized_text)

We have masked locations, replaced persons with a pre-defined value, and removed (redact parameter value) all other entities found. In addition to that, you may use hash, encrypt, and custom operator names. The latter is the most valuable, from my perspective. With custom operator we can, for examle, apply custom (surprisingly) logic to the original entity, select randomly from a set of pre-defined values, or even generate a new anonymized value from scratch!

operators={"PERSON": OperatorConfig(operator_name="custom", 
                                    params={"lambda": lambda x: random.choice(['Neo', 'Paul'])}),
           "DEFAULT": OperatorConfig(operator_name="custom", params={"lambda": lambda x: x[::-1]})}anonymized_text = anonymizer.anonymize(text=example_text, 
                                       analyzer_results=results,
                                       operators=operators).textprint(anonymized_text)

If you launch the cell for several times, you may notice that sometimes PERSON entities will be replaced with Neo and sometimes with Paul. Other entities will be reversed.

Let us add more power to the anonymization part!

Faker

Last but not least component of the pipeline is Faker — library for generating fake data. As always:

!pip install Faker

Here is a basic example from the library:

from faker import Fakerfake = Faker()print('random name:', fake.name())
print('random address:', fake.address())
print('random phone number:', fake.phone_number())

Generally, faker operates with large collections of local names, surnames, prefixes, etc. But simple interface and variety do their best to use this library instead of your own-defined values. Interestingly, we may limit the locales from which we generate our entities.

fake = Faker(locale=['jp_JP'])for i in range(5):
    print(fake.name())

fake = Faker(locale=['en_US', 'en_GB', 'en_CA', 'fr_FR'])for i in range(10):
    print(fake.name())

To use faker during the anonymization step, we need to create operators with lambda functions.

fake_operators = {
    "PERSON": OperatorConfig("custom", {"lambda": lambda x: fake.name()}),
    "PHONE_NUMBER": OperatorConfig("custom", {"lambda": lambda x: fake.phone_number()}),
    "EMAIL_ADDRESS": OperatorConfig("custom", {"lambda": lambda x: fake.email()}),
    "LOCATION": OperatorConfig("replace", {"new_value": "USA"}),
    "DEFAULT": OperatorConfig(operator_name="mask", 
                              params={'chars_to_mask': 10, 
                                      'masking_char': '*',
                                      'from_end': False}),
}anonymized_text = anonymizer.anonymize(text=example_text,
                                       analyzer_results=results,
                                       operators=fake_operators
                                       ).text
print(anonymized_text)

And that’s it! The tool works quite well out of the box and may be finalised using custom custom recognisers and via analysing the decision process.

Full notebook may be found here.

Conclusion

In this article I have briefly described a python pipeline for text anonymization. The pipeline suggested may be easily adjusted for your personal needs with more heavy NER models (en_core_web_trf, for example), custom rules, another confidence scores, etc. Moreover, you can upgrade your pipeline for multiple languages.

From the productionalization point of view, the code may also be launched with Spark via simply adding several lines of code (check the full code). Or you can use Presidio as an HTTP service.

Moreover, Presidio works with images.

PS

I really hope you enjoyed the article and found it usefull for you and your needs. Any comments and suggestions are welcome!)