Data Governance, Data Protection

How to anonymize data with Presidio

Microsoft’s SDK for Data Protection and Anonymization

Bruno Cordeiro
Bravo Lab

--

Image by Freepixels

The phrase “Data is the new oil” became popular after The Economist published a story titled, “The world’s most valuable resource is no longer oil, but data.”, back in 2017.

If we are making an analogy between data and oil, data breaches can have devastating effects similarly to how oil spills have on the environment.

A data breach is a confirmed incident in which sensitive, confidential or otherwise protected data has been accessed and/or disclosed in an unauthorized fashion. Data breaches may involve personal health information (PHI), personally identifiable information (PII), trade secrets or intellectual property.

Source: https://searchsecurity.techtarget.com/definition/data-breach

Famous cases of data breach

Adobe: In October of 2013, it has been reported that hackers had stolen nearly 3 million encrypted customer credit card records, in addition to estimated 150 million username and hashed passwords.

eBay: In May of 2014, a reported attack exposed its entire account list of 145 million users, including names, addresses, dates of birth and encrypted passwords.

Facebook: In September of 2018, the famous social network reported that an attack on its computer network had exposed the personal information of nearly 50 million users. Another case has been reported in April of 2020, where profile data of over 267 million users has been stolen and is reportedly being sold on the dark web.

Canva: In May 2019, the Australian graphic design tool website had an incident exposing email addresses, usernames, names, cities of residence, and passwords of 137 million users.

GDPR

The public’s concern over privacy has influenced the business sphere in Europe even before the Internet. Ensuring that strict rules on how companies use the personal data of European citizens are in place has made Europe a model for how our data should be protected and regulated.

The GDPR has been idealized as a way to harmonize data privacy laws across Europe and provide wider protection and rights to individuals. It states that companies collecting data on citizens in European Union (EU) countries need to comply with strict new rules around protecting customer data.

Personally Identifiable Information (PII)

PII has become a prevalent concern as it can be exploited by criminals to stalk or steal the identity of a person, to aid in the planning of criminal acts or simply to be used to create a profitable market by collecting and reselling PII.

PII is any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means. Source: https://www.dol.gov/general/ppii

What’s Data Anonymization?

In order to comply with the GDPR and other data protection legislations, companies need to have a privacy protection policy in place. Among the activities involved in the data protection is the data anonymization. In this acitivity, personally identifiable information are removed from data sets, which makes the people whom the data describe remain anonymous.

What’s Presidio?

There are several tools and services for PII anonymization. In this article we are going to focus on Presidio, a Data Protection and Anonymization API created by Microsoft.

Image from https://microsoft.github.io/presidio/

Presidio (Origin from Latin praesidium ‘protection, garrison’) helps to ensure sensitive data is properly managed and governed. It provides fast identification and anonymization modules for private entities in text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers, financial data and more.

Source: https://github.com/microsoft/presidio

Presidio Architecture

The Presidio architecture is composed by two main modules for anonymization PII in text: Presidio Anonymizer and Presidio Analyzer.

Presidio Analyzer

Image by https://microsoft.github.io/presidio/analyzer/

Presidio Anonymizer

Image by https://microsoft.github.io/presidio/analyzer/

Anonymizing Data

Okay, now that you have a clear idea about why data privacy is important and got an overview of Presidio, it’s time to get hands-on and have a taste of what Presidio can do for you.
Let’s get started by installing the Presidio Anonymizer:

pip install presidio-anonymizer

and here is a snippet where you can pass a text containing unanonymized data as a parameter to a function and you will retrieve the anonymized data in return:

Final Thoughts

We covered the basics of data protection and concluded with a very simple sample of code. Presidio provides methods beyond the ones shown here. It can also be easily extended to support additional anonymization methods. If you want to dive deeper into the capabilities of Presidio, you can access this tutorial on adding new anonymization methods and this one on supporting detection of new types of PII entities.

Even if you are not a data professional, I hope this article brings more awareness on how your personal data is sensitive and makes you pay more attention to how services and websites that you use protect it.

--

--

Bruno Cordeiro
Bravo Lab

Data scientist, writer, traveler & coffee addict. #Machine Learning #Open-Source #Company Culture