Data privacy: anonymization and pseudonymization — Part 2

Andreas Buckenhofer
Mercedes-Benz Tech Innovation
4 min readNov 5, 2020

Trustworthy handling of personal data

Last week I published an introduction into anonymization and pseudonymization. This article follows up with anonymization techniques.

Techniques

The GDPR does not specify which anonymization techniques have to be used. The Article 29 Working Party has published an opinion on various measures that can be considered depending on the use case.

There are two overall approaches: in-place and out-of-place anonymization. Out-of-place methods create a new dataset (data copy) while in-place methods work directly on the production data.

In-Place anonymization

  • Data redaction hides data like credit card numbers from users, while the original data remains in the database. The technique is usually applied in production systems based on the user’s login credentials. This method is a particular case to only hide specific sensitive data from consumers, e.g. from a call-centre agent who is not allowed to see all data.
    Many databases have built-in data redaction functionalities which is far superior compared to programming the functionality in the application. The Oracle-solution is shown in the example below with two queries on the credit_card column. A reduction policy controls the second query and returns only the last four digits of the credit_card number.
Oracle sample session for data redaction
  • Differential privacy software infiltrates “noise” into results. The user (e.g. a Data Scientist) writes SQL queries against the original data while the software monitors and prevents too detailed queries. Queries that group many rows will be returned with an additional noise while queries for individual rows are blocked. The following diagram shows a non-blocked (in blue) and a blocked query (in red).
    Apple, Google and many others also use the technique within browsers or mobile devices before sending data to a central server.
Differential privacy

Out-of-Place anonymization

  • A generated synthetic data copy with lookups or randomization can hide the sensitive parts of the original data.
    Lookup data can be prepared for, e.g. names, bank accounts, and other personal data. A data copy is created with synthetic data by replacing the original data with lookup values.
    A similar approach is the usage of randomization techniques like shuffling, masking or deleting data. Rules are created that define how to randomize data, e.g. mix all data cells within one column. Lookups and randomization methods are often combined.
    Additionally, machine learning models can support generating synthetic data with similar stochastic properties as the original data. Parts of the original data are used to train machine learning models. The diagram below shows the overall approach to create a synthetic data copy with machine learning.
Synthetic data with ML
  • K-anonymity, L-diversity, and similar methods are grouping/clustering data: the result does contain aggregated data only. The diagram shows the original data in a table, including detailed data like VINs. The resulting table contains clustered/grouped data without details. The result is called 2-anonymity in the example because the number of minimum rows within a cluster is two.
    ConfigID 1 appears in two rows with different data. It’s not possible to conclude to individuals.
    ConfigID 2 has only identical values. The result is called 1-diversity. Therefore ConfigID 2 should be removed from the resulting dataset as conclusions can be made to (the identical) individuals.
K-anonymity and l-diversity

Concluding remarks

Anonymization is a complex task. Use cases and their requirements differ vastly, which makes all projects look unique. The anonymization result might be inadequate for Analytics purposes or as Paul Ohm wrote in 2010:

Data can be either useful or perfectly anonymous but never both.

Anonymization is not just a one-time step. The combination with new data may lead to unknown re-identification risks. The anonymization process and its result require continuous verification.

Trust is becoming essential when working with data — ethics and privacy have to be protected. The entire data lifecycle must ensure data protection, from creation and storage to evaluation and sharing to archiving and deletion of data. Everyone is responsible for the trustful handling of personal data: “We need to defend the interests of those whom we’ve never met and never will“ (Jeffrey D. Sachs).

The following interview about the topic has been recorded during the above mentioned DOAG conference.

--

--

Andreas Buckenhofer
Mercedes-Benz Tech Innovation

Principal Vehicle Data Architecture. Years of experience in data-driven solutions and end-to-end data products. Lecturer on data topics at DHBW University.