Data Privacy & Data Science: The Next Generation of Data Experimentation.


Data collection of consumer data is mainstream and global. The generally accepted premise of the big data movement is to collect as much data as possible since storage is in theory cheap. Then you use distributed computing and advanced analytics to cypher through the mess to be able to answer complex questions across your enterprise. While I still believe this isn’t necessarily a great idea, there is such a thing as bad data, I digress.

To the point of this article, as a society we have reached a limit in our acceptance for the exposure of our data. The global populace is asking for the IT industry to be held responsible for the safe-guarding of individual data. If the cat is out of the bag, and collection will not stop, then the next logical question is how do we protect the privacy of individuals/groups?

In this new world order data collection must come with a corporate responsibility to protect data. Sometimes this is a legal responsibility and other times its a social responsibility. Social responsibility is quite complicated and truly a grey area. It’s all about what you feel is “right”. Recently, society has influenced policy including some very rigidly defined data privacy controls that exist in the form of legislation. To name a few:

Lets take the EU Data Protection Directive (aka GDPR) as an example of enforceable data privacy legislation. The intent of the directive is to provide a single set of enforceable rules for data protection throughout the EU, thereby making it easier for non-European companies to comply with these regulations. The regulation applies if the data controller or processor (organization) or the data subject (person) is based in the EU. Furthermore (and unlike the current Directive) the Regulation also applies to organizations based outside the European Union if they process personal data of EU residents.

GDPR is not just a slap on the wrist. If you have a breach or misuse the data you may be fined up to 20,000,000 EUR or up to 4% of the annual worldwide turnover of the preceding financial year in case of an enterprise, whichever is greater.

So what does all this mean? Enterprises must begin to separate security and privacy. Encryption, defensive cyber controls, etc are security policies.

Privacy is a data management problem with a business process wrapped around it which culminates into a data governance strategy for an organization. This includes actual human roles such as a data protection officer, data controllers, and data processors. And it includes audit/compliance reporting that includes data lineage/provenance attached to data.

Sounds boring, kind of, but a well-built governance strategy creates a workflow for the creation of advanced analytic with data privacy at the core of the design. And that is important. Designing models/analytics and then going back to add data privacy controls is much, much more difficult and sometimes impossible or at least very risky.

The problem is there is no way to help organizations get started. It’s all completely up to leadership to enforce. My time working with sensitive data within the US Government taught me that data privacy is an on-going process and tools must be inserted to abstract away human decisions on legal policies. Otherwise, engineers either take their own approach to rules and regulations or worse they will circumvent them. And, don’t forget that regulations change. If you bake your logic into your code in a custom way then your total O&M costs skyrocket on each change.

This is why we started a company (Immuta) to solve this problem. Any sane data scientist would want to inherit data lineage and data access controls versus implementing their own custom solution. Our goal is to make the entire legal process around data completely transparent to the data scientist.

Data scientists face two problems when it comes to designing models in highly regulated environments:

  1. How do I design models on top of regulated data without risking violating regulation, the privacy of the consumer or having to spend a lot of time writing custom controls into my code?
  2. How do I deploy models that run on top of data in which the policies on the data are constantly changing?

To mitigate these issues the following must be implemented and enforced:

  • Policies built into the data source that can be changed dynamically
  • A common access layer to enforce policies and control data access
  • An abstraction of existing identity management from the app or analytic, much like a single sign-on but for data

This concept starts with data access. You must empower data owners to expose their data in a way which they can control and monitor access while providing a simple way to apply mission unique policies. The following video which walks through the Immuta approach to virtualize data and apply GUI-driven policies onto data without needing an engineer in the loop. Data Owners and Data Custodians are able to expose data sources via databases, APIs and file systems through a GUI-based approach.

Once a data source is exposed, and access is controlled, the next logical problem is how to execute code on top of said data while enforcing dynamic policies on the data per each user and/or machine. The following video goes through the Immuta process of policies being enforced while data scientists query and analyze data based on changing policies on the data as well as managing authorizations of the user:

At Immuta our goal is to help bring forth the next evolution of data experimentation. We believe privacy will be at the core of that revolution. No longer is it acceptable to risk the exposure of sensitive data. But, we cannot stop the use of data and therefore we must help the enterprise build privacy in from the start.

Like what you read? Give Matthew Carroll a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.