Data Security & Data Privacy (part 2 of 2)

Published in

bigdatarepublic

5 min readJun 10, 2016

A growing number of companies are on their way to become data driven organizations. In order to take personal privacy into account, additional measures need to be taken when building data processing infrastructure. Data should be stored and processed in a secure way, and privacy controls should be in place before processing sensitive information.

In part one of these series, we focused on securing your big data infrastructure. In the second part of this blog post, we will discuss data privacy. We look at various measures that can be put in place to address to some extent privacy issues in your processing pipeline.

Part 2: Data Privacy in the Data Lake

Although many legislative reports exist on the topic of data privacy, privacy by design, and privacy in big data, not many discuss how to actually implement measures into big data processing pipelines. In this blogpost, we will provide you with practical measures that can be applied to increase data privacy in your data lake. These measures are based upon recommendations by the European Union Agency for Network and Information Security (ENISA), as published in their late 2015 report.

Data Minimization

A first step towards implementing data privacy into a data pipeline would be to make sure only relevant personal information is stored. A top-down big data approach greatly helps with this. Instead of first building a data lake, and afterwards doing data science, you would reverse this process. Start working from a business value perspective, where you first perform machine learning analysis on smaller datasets that are already available, instead of starting to gather lots and lots of data without knowing if you will ever need it. This will give a good first indication on what data is relevant for your analysis. You can use this to only store personal information that was actually relevant.

A second step would be to determine how long certain personal records should be retained. This requires a similar top-down approach to big data. It might be your machine learning model only has 1% better accuracy if you use a year of data, compared to only using a couple of months. This gives you actual facts to base your data retention policy on.

After you have gone through this process, you should inform your users which personal information you will actually be storing and processing, and for how long. This can be done in a public privacy statement for example. Even though you need certain explicit values, it might turn out that you can anonymize certain fields. This brings us to the second privacy enhancing technique (PET): Anonymization.

Anonymization

Anonymization is the process of altering/masking personal data in such a way that individuals cannot be re-identified and no information about them can be learned. One simple technique to do this is using hashing. A hashing algorithm maps the original value (e.g. a person’s name) to a string of seemingly “random” characters. Hashing algorithms have the property that for the same input, they always have identical output. Also, when you only know the output, you cannot retrieve back the input. This can keep both data utility and data privacy in the datasets, because data scientists will still be able to link different data records based on hash identifiers, while at the same time not directly being able to identify which person’s information they are processing.

However, although anonymization techniques work great on individual fields in the data, it might still be possible to identify a person when combining fields and/or records from different datasets. This remains a trade-off. In addition, although many personal data fields might be good candidates for anonymization (e.g. e-mail, name, etc.), it might be required to use the original values for other personal data such as age in order to retain utility of the dataset. Good techniques to still get to a reasonable level of privacy would be to use aggregation in combination with data separation.

Aggregation & Separation

Data privacy can still be assured to a certain extent in case of non-anonymizable personal data. One way to do this is to only report data analysis results at a higher aggregate level. This could be, for example, first clustering personal data by age or by geographical region, such that at least 10 people are in each group. This way, no individual’s privacy is at stake, while still being able to do proper analysis. This technique is actually easier to perform on big data than on small data. Simply because the sheer numbers will still show patterns after grouping.

Another accompanying technique is to use data separation in a distributed environment. This means storing different personal data fields (columns) across different servers. When used in combination with aggregation and anonymization techniques, this will prevent people from re-identifying an individual, since fields belonging to one person are completely separated. Separation can be accomplished by using technology that uses columnar storage across different storage servers. Examples of this are Apache Spark, Apache Cassandra, and HDFS storage formats such as Parquet.

Auditing

Just like for data security, auditing/recording who has access to the data and when is of utmost importance. Although this might raise privacy concerns for the data scientists, this would allow companies to pro-actively signal when a data leak is happening, and privacy of potentially thousands of people is at stake. The Hadoop eco-system provides tools like Apache Ranger to perform audit logging. This detailed logging provides the information about which user or person has accessed (or tried to access) which part or service of your data platform. Applying anomaly detection algorithms on this data could enable the data lake to monitor its users’ actions all by itself. The combination of these tools and techniques help the cause-analysis of security breaches or data leaks. This additional layer of security ensures that data leaks, can be traced even when it was caused by authorized users.

Control / right to be forgotten

As a last measure, one should take into account that an individual should have control over his personal data. This means that the individual should be able to request its personal data to be removed from the system when he or she wants. Large corporations such as Google already offer such options. Links to personal information can be removed from Google’s search results upon legitimate request.

Implementing this measure might pose one of the hardest as it directly opposes many data engineering best-practices. For example, storing all the data in an append-only format such that new data is added, but data is considered read-only until it hits retention limits. However, there are ways to implement this. One could for example write a batch job that will remove all references for a specific person from all the datalake’s datasets, and re-compacts them to prevent data fragmentation (which degrades performance).

Conclusion

Privacy, like security, should be incorporated by design, and not be an afterthought. If we are able to deal with data privacy challenges by applying appropriate measures, this could build more trust in the big data ecosystem. However, current state-of-the-art big data technology is not at all privacy focused. Thus, applying these privacy measures still requires in-depth analysis and careful planning.

BigData Republic provides Integrated Data Solutions. We are experienced in deploying large-scale big data pipelines that take security and privacy into account. Interested in what we can do for you? Feel free to contact us.