Anonymizing Unstructured Data with Aircloak Insights
The limitations around traditional data anonymization methods are well understood:
- The selection of identifiers and quasi-identifiers is a subjective, context dependent, and error prone process (e.g. adding additional data)
- There are no reliable metrics around the level of anonymization, or the difficulty of de-anonymization
- There are well documented and highly visible examples of datasets being de-anonymized
- There is a fundamental tradeoff between the anonymity and utility of a data set. A high level of anonymization reduces the level of information in the data, reducing the quality of the analytics
But there is a set of data types for which anonymization is virtually impossible: free form text, and more generally unstructured data.
Why is this important: Increasingly the most informative data is available in unstructured form. For example, descriptions of transactional data for payments and e-commerce, the content of documents, metadata, notes in medical records, text representing comments, reviews, and other user provided posts. Estimates suggest that 80% of all data is unstructured. These types of data cannot legally be shared if they contain personally identifiable information, and determining whether they do, and if so which parts are identifiable, is hard at best.
Why anonymization of unstructured data does not work: Determining what parts of a structured data set to mask, and how to mask them, is in itself a process fraught with pitfalls, and the resulting level of anonymization is virtually impossible to quantify. This complexity is exacerbated by having to detect and select (sub)strings in free text fields that need to masked. The combinatorial explosion of options makes this approach not practical at best, and leaves a dataset de-anonymizable at worst.
As an example, consider free form text that is part of a financial transaction (see figure). The description contains transaction IDs, merchant IDs, merchant names, license plate information, bank account numbers, and user frequently user provided content. Doing analytics over the raw data can provide deep insights into user behavior that is virtually impossible to obtain otherwise (examples could be what users visit “Starbucks” or pay a certain amount in car insurance). Manually classifying strings as needing to be masked , like “Starbucks”, leaves open many more strings that could potentially identify users (e.g. “Peet’s”). Creating exhaustive lists of strings to mask is simply not feasible. The alternative of providing a list of safe strings that should not be masked while masking the rest of the text, severely limits the richness of the insights that can be gathered. The creativity and foresight of the author of the whitelist becomes the limiting factor. By being the safer approach, using whitelists that eliminate most of the value of the data is what is frequently what is done!
At Aircloak we are taking a completely different approach to data anonymization. Rather than anonymizing a data sets, which can then — in theory — be shared for further analysis (acknowledging the concerns and risk of subsequent de-anonymization) Aircloak Insights provide analytics (queries) on the full data set, anonymizing just the resulting query results.
The result is the ability to do rich text analysis on the unmasked and unfiltered data sets in real-time. This includes queries matching on string patterns in any free form text fields. Instant compliance means there is no necessity for lengthy case-by-case approvals or audits — Aircloak Insights automatically takes care of privacy by adding the minimally required amount of noise or suppressing partial query results in cases where they could allow the re-identification of an individual. For details on Aircloak’s anonymization methods see our previous blog post.
So, there you have it: a solution to anonymizing unstructured data.
Aircloak Insights help organizations provide unrestricted, private, high fidelity analytics access to full unmasked data sets that retain all information originally available, including free form text/unstructured data, to their customer and partners.
Follow Aircloak on Twitter or sign up for our newsletter for future updates!