Meet European Data Security and Privacy Compliance with Big Data Analytics in Public Cloud Environments — Part 3: De-identification Framework via Open Source

10 min readNov 3, 2021

“The best time to start was yesterday. The next best time is now.”

This is a famous quote that could not fit better to the current situation for European enterprises facing challenges for Big Data analytics on public clouds. In the previous parts of this article series we discovered that the major obstacles of cloud adoption for European companies are data privacy concerns (see article 1, article 2). We have presented a solution for this market problem by providing a technology agnostic framework for data de-identification with the goal to build a de-identified data lake. In the second article we have applied our theoretical framework and provided three enterprise ready solution architectures for AWS, GCP and Azure where we have used cloud native services for the de-identification of the data. But what if you don’t want to move your data to the public cloud before de-identifying it by yourself in a trusted environment? What if you do not want to wait for sovereign cloud services to become generally available and you need to act now? How would you de-identify your data by yourself to be able to fully leverage public clouds for big data analytics?

In this article we will answer the above questions, showing you how to own your data de-identification process to create a de-identified data lake on a public cloud and perform big data analytics. We will show you how to perform data de-identification at scale in trusted on-premise environments before moving your data to the cloud. We will show you which technologies to use and explain why. Again you will find the familiar concept of the de-identification framework like introduced in article 1. Finally, we will provide an open source repository allowing you to quickstart your de-identification framework on-premise with a step by step guide.

Foundations

Lets quickly recap the components of our de-identification framework. As described in article 1 our solution requires the following components:

Data catalog
PII detection/cataloging
Data de-identification pipeline
Data re-identification pipeline
De-identified data lake

Thinking in processes, the first step is to get a sample of a data asset, infer its schema and tag fields that may contain PIIs and persist this information in the data catalog. We will refer to this data as the metadata and to the process as the PII detection and cataloging pipeline. The second step is to apply de-identification treatments on the data itself. For this we will use the catalog metadata so we know which fields should be treated and apply our privacy policies. After the data is de-identified we finally move it to the de-identified data lake. We will refer to this second process as the data de-identification pipeline. In case you are not familiar with our de-identification framework, we highly recommend you to check out article 1, where we describe it in detail. The next figure shows the de-identification framework as a technology agnostic architecture, with steps describing each data pipeline.

Remember, our goal is to own the de-identification process. Therefore it is required to deploy the PII detection and the data de-identification pipelines on-premise. You should consider deploying the re-identification key store in a trusted environment too, since this component will hold the keys to undo your de-identification treatments on your data. However, it is not required to deploy the data catalog on-premise. In fact you might use public cloud services for this like AWS Glue, GCP Data Catalog or Azure Purview. We do not need to own the data catalog solution, since the catalog metadata does not contain any PII but solely describes our data on a dataset level.

Now that we have recapped the framework with all the components and pipelines, which technologies should we use to implement this solutions? We understand that in a scenario where a customer has a high mistrust towards cloud service providers it is a common objective to avoid vendor lock-in or and other long term engagements. Therefore we will select only open source technologies, allowing us the highest flexibility.

Technical Solution Overview

Once we have a clear solution to the problem from a functional point of view, let’s see how opensource technologies would fit in each of the different needs. To present the solution, we will follow the scheme outlined in the de-identification framework and explain the reasons for the choice of the technologies for each component.The following image shows how our de-identification framework would look like with our choice of open-source technologies.

Figure 2: De-identification framework with open source technologies

In our research, we have compared two commonly used data catalog solutions, namely Datahub and Apache Atlas. Datahub is Linkedin’s solution for the Data Catalog. Apache Atlas is a popular data catalog solution, part of the Apache software foundation. Our research has shown that although Datahub is a powerful tool with a lot of potential, it is not suitably production ready. Therefore we have decided to choose Apache Atlas as the data catalog solution for our proposal. Apache Atlas is a technology used by large companies and service companies, and has a maturity of more than 6 years in GA. It is designed and built to support real cases of data governance, being fully adaptable to most needs.

For the metadata generator pipeline we decided to use Spark. The reason was that Spark is a well-known framework used by large companies to execute data engineering, data science and machine learning processes. With Spark we are able to access all kinds of data sources, such as file systems, databases or event-driven systems, allowing us to perform analysis and transformations on the data.

For the detection and classification of PII in our data we chose Presidio. Presidio is an open source library developed by Microsoft that using pre-trained ML models allows us to identify and categorize the parts of our data that contain PII. One of the advantages that caught our attention is that Presidio offers the possibility to create your own models and integrate them into the library to detect custom PII or sensitive data, such as PII specific to your business, for example a user ID. In addition, with Presidio, we can decide which de-identification method we want to use for each type of PII, and we are able to create our own de-identification functions.

During the schema and metadata generation pipeline, we can differentiate between different stages. The first one is the reading of a sample of the data from our dataset, the larger and more varied this sample is, the more accurate the result of the cataloguing process will be. The next step is to analyze the structure of our data, the fields that compose it and what type of fields they are. Once the structure has been analyzed we proceed to analyze the content of our data in search of possible PII, this is where Presidio comes into action, offering us a confidence score for each of the identified PII. The last step is to send the results of our analysis to our Data Catalog tool, in this case Apache Atlas.

After creating the schema and metadata automatically with the generation pipeline, it is advised for data owners and data stewards to review the automatically generated catalog data. By publishing the results in our Data Catalog, we democratize the data, since anyone, not only technicians, can see what information the different Datasets contain, as well as their possible relationships, such as who owns them or where they are used.

With our metadata accessible from our data catalog, our pipeline is complete. The following image shows in a more visual way what this process would look like on an example of a “movie purchase history” data asset. Please note how the raw data, the inferred schema and the schema enriched with PII metadata are assigned to different parts of the pipeline.

Figure 3: Metadata generator pipeline example

Figure 4: Detailed view of inferred schema enriched with PII metadata

Once we have all the necessary metadata extracted, analyzed, stored and reviewed, it is time to move on to the de-identification process. For the de-identification pipeline it is necessary to choose a technology capable of distributed computing, in order to manage the workload of the different simultaneous pipelines. Our research shows that Apache Flink is a suitable technology for this pipeline. Flink is capable of adapting to both batch and streaming ingestion processes, with low latency and high throughput, while maintaining consistency and being fault tolerant. In addition, by using Spark and Flink, we can use the advantages of Apache Beam to unify all our data processing jobs under a single programming model.

The first part of the de-identification process is to retrieve everything that is necessary to treat the data, such as the schema of our dataset and its metadata from Apache Atlas as well as the encryption keys that we will need to be able to apply encoding de-identification methods. Once all the necessary elements are obtained, the pipeline starts reading all the records of our dataset and applying the de-identification methods on PIIs that we have previously detected by using Presidio.

Finally, we serialize our data with AVRO to reduce the size of the data and to validate that we did not break the schema when applying our de-identification treatments. Once our data is ready, it would be sent to our de-identified data lake for use in big data analytics processes. The following image shows the de-identification pipeline on an example for a “movie purchase history” data asset.

Figure 5: De-identification pipeline example

It is possible that at some point, you may need to re-identify the data in your data lake for some internal company process. To solve this problem, we will reuse our de-identification process but reverse the steps, since the technologies involved are the same. The first part of this pipeline would be the same, retrieve all the necessary information from our Data Catalog and the necessary decryption keys. Then, read the data we want to re-identify from our de-identified data lake and send it to our on-premise process or generate the necessary artifact.

Figure 6: Re-identification pipeline example

With this we have covered the de-identification framework with all of its components, pipelines and open source technologies. Our approach allows us to build a rich data catalog and use it to apply data de-identification treatments to build de-identified data lakes.

Public Repository with a Step-by-Step Guide

Did you get curious and want to start applying our de-identification framework by yourself? Or do you just want to try out how the pipelines, the de-identification or the Apache Atlas data catalog work by yourself? We got you covered! On top of our article series we are publishing a public repository with a step by step guide that helps you to run our data de-identification framework within a few minutes. Check out our repository here. You will find a detailed documentation and the README will guide through all the steps of the metadata generation, de-identification and re-identification pipelines. You can easily deploy our proposal with docker or extend it to an enterprise ready solution. The repository covers all pipelines and provides dummy data to immediately start working with. In case you are lost or have any questions, feel free to contact us at Keepler for top notch consulting services or if you would like to discuss the topics or your challenges further.

Conclusion

In this article series we have covered how European companies can leverage the public cloud for big data analytics and remain security and privacy compliant. In the first article we have identified the foundation of the problem, namely strict data privacy regulations and a general mistrust toward third parties like external attackers, government authorities and towards the public cloud providers themselves. We have presented a de-identification framework that allows us to build de-identified data lakes on public clouds without losing data or cloud utility. In the second part of the article series we have presented three enterprise ready solution architectures for the three major cloud providers AWS, GCP and Azure. We investigated the upcoming possibilities of sovereign public cloud offerings that may change the European cloud market as known right now. Finally, in this last part of the article series, we have applied our de-identification framework on open source technologies, allowing companies to own the de-identification process and deploy it in trusted environments. In addition to this we have provided a public repository with deployable step by step guide of our de-identification framework.

Our closing thoughts are that, data privacy and security will become more and more relevant in the time of data being the most valuable asset for companies. We are completely aware that the success of data driven companies is directly dependent on their ability to adapt quickly and generate business value with state of the art approaches. Therefore we believe that the use of public cloud services is crucial for the success of European companies. The ones who can not adapt or adapt slowly will experience high competition and face dramatic consequences. With our article series we hope to contribute towards the adoption of public cloud for big data analytics.

Contact us for more top-notch consulting services. Our vision is to become a consulting key player in Europe. Follow us on LinkedIn!

Authors

Diego Prieto, Cloud Architect at Keepler Data Tech.
Alexander Deriglasow, Cloud Engineer at Keepler Data Tech.