Data privacy: anonymization and pseudonymization — Part 1

Andreas Buckenhofer
Mercedes-Benz Tech Innovation
4 min readOct 30, 2020

Trustworthy handling of personal data

The future is digital — there is no doubt about it. The handling of the data is crucial. Not only compliance with the General Data Protection Regulation (GDPR) plays a central role. Everyone is responsible for the trustful handling of personal data: “We need to defend the interests of those whom we’ve never met and never will“ (Jeffrey D. Sachs). The article summarizes my talk on anonymization techniques at the DOAG conference.

Personal data

The processing of personal data must be carried out following the law. Personal data as, e.g. name, birthdate, email address or place of residence are pretty obvious. But data like dynamic IP addresses or the vehicle identification number (VIN) also belong to personal data. The processing of personal data always requires a legal basis (e.g. consent) and a previously determined purpose. As a privacy friendly measure, anonymization can be employed to limit the processing of personal data where feasible. It should be noted, though, that the anonymization of personal data itself requires a legal basis.

Pseudonymization and anonymization are very different but often confused. The diagram below summarizes anonymization and pseudonymization. Pseudonymization is based on techniques such as hashing or tokenization, while anonymization techniques remove the personal reference. There is a remaining risk that data can be re-identified after anonymization. The effort for re-identification with regards to time, money, or complexity must be high enough to reduce the risk to a minimum. The GDPR does not apply to anonymous data anymore. Pseudonymization is different, though. It is still possible to follow back to individuals with a “decryption key”, and the GDPR is still applicable.

The example illustrates the three variants:

  • Personal: “Max Mustermann drives the vehicle with the registration number UL-WB 134”.
  • Pseudonymous: “Mr. fe435rat5 drives the vehicle with the registration number a9f2ebfa70b02d97f7” (if there is a decryption key).
  • Anonymous: “A car owner from Ulm drives an A180” (provided the reference group is large enough).

Examples of failed anonymization

Incorrect processing of personal data can result in substantial fines, not to mention damage to the company’s reputation.

A taxi company in Denmark did not anonymize the data of its customers sufficiently and was fined by the authorities. The authorities rejected the taxi company’s defence that the telephone number was part of the primary key and could not be deleted. The anonymization failure by keeping phone numbers might be apparent. Still, there is also a lesson to learn for physical database design: think twice before using personal data as a primary key.

Professor Lataney Sweeney demonstrated in 1997 that 87% of Americans could be re-identified if the date of birth, gender, and zip code are provided within a data source. She used a voter register and linked the data sources together to deduct to individuals. Several years later, the Facebook myPersonality app published data along with the date of birth, gender and zip code. Therefore, the published data of the myPersonality app was only pseudonymized, but not anonymized.

Use Cases

An essential principle of the GDPR is data minimization. Anonymization can be an important tool to realize this principle in practice. The GDPR is no longer applicable for anonymized data. Typical use cases are listed below. Use-case relevant anonymization techniques are mentioned, too. The second part of the article describes these anonymization techniques.

  • Data usage in test and development environments.
    Data in such environments must be anonymous. The techniques “synthetic data with lookups or randomization” can be regarded to create a data copy.
  • Data analytics like visualization, discovering new insights or Machine Learning.
    The analysis can be done using techniques like “differential privacy” or a “grouping/clustering” approach. Alternatively, analysis is also possible on synthetic data that has been created by Machine Learning models to preserve statistical properties.

As a single method often is not sufficient, many technologies can be combined. Gartner included laws and privacy-preserving techniques like differential privacy or synthetic data in their Hype Cycle for Privacy (2020). There are also Open Source projects like Google Rappor for differential privacy or ARX containing several methods.

The second part of the article contains anonymization techniques. Stay tuned for more next week.

--

--

Andreas Buckenhofer
Mercedes-Benz Tech Innovation

Principal Vehicle Data Architecture. Years of experience in data-driven solutions and end-to-end data products. Lecturer on data topics at DHBW University.