Data Security & Data Privacy (part 1 of 2)

Published in

bigdatarepublic

5 min readMay 27, 2016

In 2014 the European Union court of justice ruled (EU Data protection factsheet) that individuals have the right to ask search engines to remove links with personal information about them. In the Netherlands, since first of January 2016, companies and authorities are required to report cases of data leaks (website in Dutch) to an authority and in some case to involved individuals as well.

On the other hand, a growing number of companies are on their way to become data driven organizations. In order to comply with mentioned legal developments, additional measures need to be taken when building data processing infrastructure. These measures, however, might clash with data engineering best-practices. A simple example is the preferred append-only approach for data gathering. Stored events could easily involve personal information about your costumers. What happens to your system when someone requests to remove some of his personal data, and as a consequence breaks the append-only assumption?

In this two-part blog post, we look at the implications of these legal developments for your big data infra structure and analytics pipeline. In part one we focus on securing your big data infrastructure. In part two we look at various measures that can be put in place to address to some extent privacy issues in your processing pipeline.

Part 1: Data Lake Security

Good data lake security is like an onion: the core is surrounded by multiple layers of security measures. This way, reaching the core means that all those layers must be breached to achieve the valuable information: your customers’ data and the algorithms that encode how you drive your business towards a competitive edge. Important layers in data lake security are: fire-walling, authentication / authorization, encryption, and auditing. We will address all of these layers in this blogpost.

Fire-walling / access control

A first layer of defense around your data lake. A firewall is a system which protects a computer, or network of computers against abuse from outside sources. What outside means, depends on the level of security you want to put in place. Unfortunately, security breaches and data leaks are not always caused by external sources. In large organizations it is therefore good practice to place sensitive infrastructure, such as a data lake, behind a dedicated firewall. This way access to a data lake can be tightly controlled on the network level. With a firewall you can secure with relative simple measures such as packet filtering, or more advanced measures such as state-full filtering and application layer filtering.

To further enhance the security enforced by a firewall, you can put your data lake behind a proxy. A proxy hides the network topology of the cluster that is behind it. Besides this security advantage, it also helps the scalability of your architecture. Furthermore, proxies can be integrated with authentication systems. In the Hadoop eco-system, Apache Knox is a well known example of such a proxy. It integrates tightly with components from the Hadoop stack and can connect with identity management systems such as Active Directory or LDAP. This brings us to the topic of authentication.

Authentication / Authorization

The next layers of security are authentication and authorization. Authentication is the process of ensuring that people or services who access data or algorithms on the data lake, really are who they claim to be. A widely used method for this process is the Kerberos protocol which is integrated in enterprise identity management systems such as Active Directory. For human end-users, security can be further enhanced by requiring two-factor authentication. This requires end-users to first provide their username/password combination, followed by a code that is generated on external device, usually by a smartphone app.

Knowing who is requesting access to data or a service is not enough. It may well be that certain persons or services should only access a limited subset of the available data and predictive services on your infrastructure. In most organizations which take security serious, the least privileged principle is applied. Meaning, both users and applications, only get access to the minimal amount of data and services that is required for them to operate properly. This is what is meant by authorization, describing who gets access to what. In the Hadoop eco-system Apache Ranger offers centralized security administration and let you control access to data and services in your data platform.

Auditing

Knowing who is using the infrastructure, and trusting that this person or service is really who it tells to be is not enough. In case of security breaches or data leaks it is essential to understand who has had access to what. The Hadoop eco-system provides tools like Apache Ranger to perform audit logging. This detailed logging provides the information about which user or person has accessed (or tried to access) which part or service of your data platform. Applying anomaly detection algorithms on this data could enable the data lake to monitor its users’ actions all by itself. The combination of these tools and techniques help the cause-analysis of security breaches or data leaks. This additional layer of security ensures that data leaks, can be traced even when it was caused by authorized users.

Encryption

Finally, data encryption is key for securing the data in a data lake. To secure your data, it is imperative to encrypt it both in transit and at rest. Data encryption in transit makes sure the data communication between different servers is secured. Eavesdroppers will not be able to make sense of the data transferred between different components in the system. Data encryption at rest encrypts the data on physical hard disks. This renders the data useless when physical data storage disks are compromised.

Many components in the Hadoop eco-system allow for encryption to be configured. Configuring encryption correctly is crucial for securing the data in the cluster.

Conclusion

Security is as good as the weakest link, or as the thinnest layer to stay in terms of the onion. Stacking multiple layers of security measures will strengthen the overall defense of your valuable data and intellectual property. In any case, data platform security is something to take in account from the beginning and not something to add as an afterthought. Many of above mentioned security measures need analysis and careful planning to get properly implemented.

BigData Republic provides Integrated Data Solutions. We are experienced in deploying large-scale big data pipelines that take security and privacy into account. Interested in what we can do for you? Feel free to contact us.