How to increase the value of your anonymized data?

Why do you need to anonymize your data?

Everyone remembers the painful data breaches in recent years such as Facebook/Cambridge Analytica, Uber, Equifax, … The list [1] is long. 
These data breaches and the uncareful data management led to new personal data protection laws. Europe’s data protection regulations (GDPR) are active since the 25th of May 2018 and other countries and states followed soon thereafter such as California, with its Consumer Privacy Act which came into effect on the 28th of June.

These data protection regulations require businesses to first ask for consent to collect personal data. Businesses are also expected to protect personal data and allow consumers to access, correct and even erase their personal data (“Right to be forgotten, Art. 17 GDPR”). Data breaches and other violations can lead to fines. For GDPR violations, the fines can be as high as 20 million EUR, or 4% of the worldwide annual revenue of the prior fiscal year, whichever is higher [2].

There are two obvious approaches to comply with these data protection regulations. The first one is to build and maintain a secure infrastructure. Such a system should ask for consent before collecting personal data. Additionally, the infrastructure has to allow users to access, correct and erase their data. The second approach is to remove all the personal identifiable information from the data to effectively anonymize it. This allows a business to use the information totally free and it does not need to grant users access to the data. This also completely removes the risk of leaking personal identifiables during data breaches. It seems that anonymizing data is the way to go if you compare it to building a secure control infrastructure. However, there is a drawback in anonymization.

The Quality of Anonymization

The huge amount of data collected in the last decades, allowed new and old companies to leverage machine learning techniques to build new products and revolutionize old products. 
“Data is the new oil” is the new catchphrase that is posited in many economy/technology articles [3]. Recently the quality of collected data is gaining a lot of attention because it improves the quality and robustness of AI systems [4][5].

This is problematic for conventional anonymization methods since they usually operate by selectively destroying pixel information (masking, blurring, pixelization, etc.). Therefore, there is always a tradeoff between compliance and data quality. Compliance and data quality are two endpoints of a line where one cannot be improved without harming the other.

Instead of removing pixel information, our anonymization method changes and keeps it natural. We call it Deep Natural Anonymization (DNA). DNA detects faces and other identifiables, such as license plates, and generates an artificial replacement for each one of them. Each generated replacement is constrained to match the attributes of the source object as good as possible. Nevertheless, this constrain is selectively applied, so that we can control which attributes to maintain and which not. For faces, for example, it could be important to keep attributes like gender and age intact for further analytics. Identifiables aside, the rest of the information that does not contain sensitive personal data is kept without modifications. Thus, DNA effectively breaks up the tradeoff between removing and anonymizing data.

Preserving Analytics

To measure the impact of anonymization approaches on the quality of the data, we sampled images from the Labeled Face in the Wild (LFW) [6] dataset. All images were taken from the test set. We compare four different anonymization tools which represent general groups of anonymization techniques [7][8][9][10]. We selected tools which are generally accessible to the public. Figure 1 shows a selection of these examples.

Fig. 1: Comparison of anonymization techniques

Structural Consistency

In a first step, we analyse how the overall structure of images changes after they have been processed by the anonymization techniques. For that purpose, we take a closer look at image segmentation results. Image segmentation [11] is the process of partitioning the pixels of an image into multiple segments. Each segment represents one object class. In our example, the most important objects are the person in the profile picture and the background. Figure 2 and 3 show the segmentation maps for two samples of the celebrities in the LFW dataset. The segmentation maps were produced by a state-of-the-art semantic segmentation model called DeepLabv3+ [12]. We used the implementation and model weights from the official tensorflow repository [12].

In figures 2 and 3 we can see that the segmentation maps of traditional anonymization methods are clearly degraded and some of them are completely wrong. Deep Natural Anonymization, however, preserves the semantic segmentation. The segmentation maps are almost identical compared to the original. From figure 3 we can see that faces images processed by conventional anonymization methods not only produce bad segmentation boundaries, but also make the segmentation model to infer completely new object classes that were never present in the original image, like cats or bottles.

To quantify the impact of each anonymization technique, we calculated the mean of the intersection over union (mIOU) across the whole test set [13]. The calculation was done between the segmentation maps of the images generated by the different methods and the original ones. The results are reported in table 1.

Table 1: Semantic segmentation consistency measured with mIOU. Higher is better.

Content Consistency

To asses the general content consistency between the anonymized images and the original ones, we used an independent image tagging model from Clarifai [14]. “The generic image tagging model recognizes over 11,000 different concepts including objects, themes, moods, and more […]“. The tags describe what the model infers from the input image. Additionally, the model gives a confidence for each tag. Figure 4 shows the top 5 concepts, predicted by Clarifai’s public image tagging model [15], of the original and its DNA version.

Fig. 4: Reese Witherspoon top 5 concepts from clairifai. Left: Original. Right: DNA (ours).

Ideally, a generic image tagging model should predict the exact same concepts for both the original and the anonymized image. To measure the consistency, we used Clarifai’s solution to predict the concepts in all our test samples for each different anonymization technique. Afterwards, we calculated the mean average precision (mAP) of the top N predicted concepts (where N stands for the amount of different concepts) between the anonymized images and the original ones. With the mAP we evaluate two things: the concurrencies of the predicted concepts and their associated scores. As an example, consider an anonymized image and its original pair that have been processed by the image tagging model. A concept with a lower confidence value in the anonymized image with respect to its original pair will have less impact in the final mAP score than a concept that occurs only in the anonymized image, and not in its original pair. The results for the top 5 and top 50 concepts are listed in table 2.

Table 2: Image concept consistency measured with mAP. Higher is better.

Final Thoughts

Our experiments prove that data can be anonymized in a way that preserves the valuable information. In addition, it highlights the fact that even models which have never seen anonymized data before, still perform like they would on regular data. This allows businesses to be GDPR compliant and therefore protected from high costs caused by implementing and maintaining complex data management infrastructure or data breaches. Moreover, a sufficient anonymization opens up further business opportunities for companies in order to create revenue by selling high-quality datasets. Also, resource-heavy processes like image annotation can be outsourced to third-party partners in non-EU countries to drastically reduce personnel costs.

Stop destroying your data

Interested in being compliant without destroying valuable information? We offer an industry approved anonymization solution, which can be easily deployed directly into your data center so your data never has to leave your roof. For further information please contact us under hello@brighter.ai.

About the authors

Patrick Kern, CTO at Brighter.AI 
https://www.linkedin.com/in/pkern90/

Elias Vansteenkiste Lead Research Scientist at Brighter.AI
https://www.linkedin.com/in/elias-vansteenkiste/

Further reading

We recommend some articles and videos for more details about GDPR.

Resources

[1] https://en.wikipedia.org/wiki/List_of_data_breaches

[2] https://www.gdpreu.org/compliance/fines-and-penalties/

[3] Economist: The world’s most valuable resource is no longer oil, but data, Fortune:Intel CEO Says Data is the New Oil

[4] Why you need to improve your training data, and how to do it

[5] Software 2.0 by Karpathy, Director of AI at Tesla

[6] http://vis-www.cs.umass.edu/lfw/

[7] https://www.facepixelizer.com/

[8] https://youtube-eng.googleblog.com/2017/08/blur-select-faces-with-updated-blur.html

[9] https://fastredaction.com/

[10] https://brighter.ai/

[11] https://en.wikipedia.org/wiki/Image_segmentation

[12] https://github.com/tensorflow/models/tree/master/research/deeplab

[13] https://en.wikipedia.org/wiki/Jaccard_index

[14] https://www.clarifai.com

[15] https://www.clarifai.com/demo