Unpacking anonymization techniques
by Paul Stone
There has been growing privacy concerns with the ways in which data is increasingly collected, shared and used. In response, public officials are looking at new ways to manage personal data and address re-identification risks in their process to become open by default.
In this context, we had a fantastic turnout at the Charter’s Implementation Working Group (IWG) meeting last week where we had Alistair Ramsden, Senior Analyst at Stats New Zealand, share his expertise on data confidentiality — from a practitioner’s perspective (and a huge thank you to him for being up and ready to present at 6am!). Below are some of the tips he shared.
To start, privacy can be a loaded word. Privacy protection can differ (de-identified/anonymised/confidentialised) and mean different things in different places, so it’s important to understand how people use them where you are.
For example, in New Zealand the word ‘anonymised’ is not used (by statisticians at least) — they prefer to say ‘de-identified’.
De-identifed means exactly that, direct identifiers have been removed. De-identified data is only suitable to be shared (say for research purposes) under strict privacy controls due to the likelihood of being able to re-identify people in the data.
Confidentialised data has been de-identified and has also been processed through one or more statistical methods to further prevent the ability to re-identify people. Confidentialised data may be appropriate to release as open data, but the context must be considered. As we discussed in July’s session on privacy, there are scenarios where other information in the public domain can make it easy to identify individuals. As Alistair said “we have to imagine the worst case scenario”.
Then we went through the legislation, policy, principles, protocols and rules that govern how statisticians carry out their work. The legislation and policy may differ from country to country, but the principles, protocols and rules are fairly universal — links and definitions to these can be found in his presentation.
When it comes to statistical approaches,you can “suppress certain cells or add random noise”. Perturbation and aggregation are techniques that are commonly used. Whereas two emerging approaches are the use of machine-learning to “data confidentiality as a service” piloted in New Zealand, and the relatively new approach of “Differential Privacy” that the US Census is implementing.
Questions from the group lead to more conversation about the balance of utility versus safety and the trade-offs made through perturbation (when the variable you are “making noise” around is of particular importance and accuracy would be more useful).
An interesting question was whether it was risky to be transparent about methods used in preparing a dataset — could it mean the dataset could be reverse engineered to re-identify people? However, Alistar claims that is unlikely and that transparency of method is essential for trust and confidence in the data.
- In discussion a case about the re-identification of Massachusetts Governor William Weld’s medical data came up, here is a link to the case.
- The Extractive Industries Transparency Initiative (EITI) will be joining future IWG meetings. They support 52 governments around the world and may use the peer exchange from the calls
- The Government of Ontario are developing new API guidelines and are available on GitHub for feedback. They acknowledge that their work has been accelerated by building on previous work of governments elsewhere, including Canada, UK, US, Victoria (Australia) and New Zealand. Hooray for open licences!
- The Data Protection and Regulatory Unit of Uruguay has their own criteria to dissociate personal data for publishing.
If you attended the IWG meeting, then we’d really appreciate your feedback in this very brief survey. And if you have any ideas on topics to explore or you are interested in showcasing a project to lead discussion, please contact us at firstname.lastname@example.org.