Introduction to Privacy-preserving AI & analytics

Maxime
Maxime
Sep 21 · 5 min read
Privacy-preserving learning is about studying groups of individuals while not revealing information on each individual
Privacy-preserving learning is about studying groups of individuals while not revealing information on each individual
Privacy-preserving learning is about studying groups of individuals while not revealing information about each individual (Photo of a flock of birds by Kristaps Ungurs on Unsplash)

Data is powerful because it reveals how people think or behave. Whether you analyze business trends or train an AI model, the bigger and deeper the data, the more valuable the results. But as the opportunities for innovation increase, so do the privacy risks with costlier breaches and stricter regulations.

To reconcile innovation and data protection, a new field has emerged: Privacy-preserving Learning. This is a set of theoretical frameworks and technologies that aim at developing AI and analytics solutions while preserving privacy throughout the data science workflow.

We will explore some of these frameworks, explain how they protect privacy, and present how Sarus offers a novel approach to achieve privacy-preserving learning.

The goal of Privacy-preserving Learning

Privacy-preserving learning aims at realizing two seemingly contradictory objectives: learn all sorts of insights from data while not revealing the event-level information the data is made of. To understand it, we need to better define what needs to be kept secret and what can be shared. It is helpful to think in terms of personal information — that relates to one individual — and general knowledge — that is true irrespective of the addition or subtraction of each individual.

To preserve privacy, information that relates to individuals must be protected. On the other hand, general knowledge can be extracted. Privacy-preserving learning is possible because machine learning is precisely about learning patterns that apply to individuals in general but not specific to one individual in particular. For example, one might want to learn if there is a correlation between smoking and cancer without having any interest in whether a given participant smokes or has cancer.

The goal of Privacy-preserving Learning is to enable the acquisition of general knowledge while striving to preserve personal information. These two objectives are not as contradictory as they might initially appear.

Protecting data throughout the data journey

When it comes to data protection, as with any security objective, one is only as protected as the weakest link. So efforts are made to protect personal data everywhere. A simple learning workflow has two steps: staging and sharing results. Solutions have emerged to address data protection risks in both.

The staging phase consists in creating an environment for the data practitioners to work on data. The property of protecting personal data in the staging process is referred to as Input Privacy.

A common approach to securing this flow is to turn the original dataset into a less sensitive version using Data Masking techniques. However it has many known limitations: it lacks a way to assess data protection, it is hard to use with rich or unstructured data, and expertise is needed to define data masking rules for each use case.

To further minimize risk, the data owner could set up a remote execution environment so that data scientists work without accessing data directly. It avoids copying the data and exposing the whole dataset to the data scientists which is the source of many leaks. With a remote execution framework, only what is learned is shared with the data scientist. When the data is spread across multiple locations (eg: personal devices, hospital servers), then learning remotely is often referred to as Federated Learning.

Following a similar logic, the staging phase could implement cryptographic techniques to allow learning on encrypted data. This way the data practitioners no longer see the original data that is used in their computations. There is a whole field of cryptography research which focuses on computation on encrypted data with techniques like Homomorphic Encryption or Secure Multi-Computation. Encryption is especially useful if the computation cannot take place where the data was originally located. It allows moving data without the associated risk.

At Sarus, we believe that the ability to learn remotely is a must-have to overcome the limitations of data masking. We will always make sure that the data stays safe and where it belongs. Because our clients have computing power in their data infrastructure, they do not need to move data and are able to compute in clear without the need for complicated encryption layers.

One thing to keep in mind is that learning remotely or on encrypted data does not guarantee against leakage of personal information. The output of the computation could be as sensitive as the original data (it may focus on one individual or even be a copy of it). To address this risk, we need to focus on the second step of our flow.

Whether the output of the computation is the response of a query or the fruit of a full machine learning training, it may reveal personal information. Making sure that the output protects privacy is referred to as Output Privacy.

Historically, researches have resorted to simple heuristics to ensure that outputs no longer contain identifying information. Techniques like aggregating with a high enough threshold, applying k-anonymity, or l-diversity provide some benefits but all have well documented weaknesses. The main weakness is that they assume an attacker has limited background knowledge, which becomes less and less tenable with the profusion of public information available on individuals. Also, they become impractical when the output data has a high dimension.

Recent research in mathematics provides us with the right framework to address this: Differential Privacy. Implementing it guarantees that the output of a calculation does not reveal significant information on individuals.

All prior protections may be worthless if the output is not safe. At Sarus, we built Differential Privacy into the core of the engine so that any result that is learned is privacy-protected, irrespective of the type of data or calculation.

Building the foundations for safer and more efficient data innovation

At Sarus, we believe that data science workflows should address privacy concerns across the entire data journey. Resorting to a robust mathematical framework is indispensable for both safety and scalability. Also limiting the number of people who access sensitive data is key to reducing risks. It is not so much a question about trusting an internal data science team but the only way to enable leveraging data across business lines and countries, or collaborating on data with external partners.

We also want to weave privacy preservation into existing data infrastructures and workflows without having to radically change how data is harvested, stored, and analyzed today.

This is why we are building a versatile Differentially-Private Remote Learning platform. It fully addresses both input privacy and output privacy in one go. Sarus is installed next to the original data and data practitioners can work on any dataset seamlessly through the Sarus API. They interact with remote data just like they would with local data, except that they do not have access to individual information. As a consequence, all models or insights are provably anonymous.

Sarus both accelerates innovation and improves data protection practices. Companies can work with external partners such as AI vendors or consultants without taking the risk of leaking personal data out. It creates new collaboration opportunities on sensitive data for innovation or research. There are numerous applications in industries where data is highly protected such as healthcare.

The use of data for AI applications will grow exponentially if data can be leveraged across departments, borders, or companies. But in order to get there, it cannot be the data that travels, it must be the general knowledge which is also where the real value lies.

Sarus Technologies

Powerful data science & analytics — Stronger data protection

Maxime

Written by

Maxime

Cofounder and CEO @ Sarus

Sarus Technologies

Sarus empowers enterprises who want to leverage their most sensitive data to create better AI models and database analyses.

Maxime

Written by

Maxime

Cofounder and CEO @ Sarus

Sarus Technologies

Sarus empowers enterprises who want to leverage their most sensitive data to create better AI models and database analyses.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store