Anonymization is a hard problem.

2 min readJul 30, 2019

True “full” anonymization, some argue, is impossible; it’s all relative, as a perfectly anonymized dataset has no utility for analysis. A good way to measure the level of privacy-protection afforded by an anonymization scheme is to try to work backwards, trying to see if given a set of data, it can be associated back to its source, such as re-identifying the person that is described by the data.

This is exactly what a group of researchers in Europe have tried to do, but on a larger scale. They automated the process by developing a statistical model that can estimate the probability of re-identification from a given set of personal data in many typical cases.

The model developed by these researchers (paper published this past week: https://www.nature.com/articles/s41467-019-10933-3) suggests that complex datasets of personal information cannot be protected against re-identification by current methods of anonymizing data, such as releasing subsets of larger data repositories.

“Even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR [or PIPEDA, CCPA, HIPAA, etc.] and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.”

The implications are wide reaching, from regulators and lawmakers, down to us, the individuals, generating and sharing our data every day. Interestingly, the model proposed could be used to guide technical teams in testing the robustness of new anonymization processes…

Anonymization is a hard problem.

Written by Ralph Baddour