Privacy Enhancing Technologies and why they’re vital for healthcare innovation

Data is the lifeblood of AI powered research. But in healthcare, a data access problem is stifling much needed innovation. An emerging set of privacy-enhancing technologies ( ‘PETs’) is a clever solution that’s enabling AI researchers and startups access data on some of the world’s biggest problems.

Published in

CodeX

9 min readSep 2, 2021

The COVID-19 pandemic has supercharged the scope of the issues the global healthcare industry was already grappling with. When the pandemic arrived, healthcare organisations often struggled to find the basic information they needed to respond — whether it was disease and death rates or the availability of hospital beds and critical supplies. Among other problems, the pandemic highlighted the desperate need for collaborative data analytics in healthcare.

As McKinsey observed, healthcare’s digital barriers are often decidedly non-technological. The technology is out there (or rapidly evolving) — in October 2020, Pfizer and IBM researchers announced that they have developed a machine learning technique that can predict Alzheimer’s disease years before symptoms develop. IBM, Salesforce, and Google, among others, have developed AI tools to predict the onset of conditions like diabetes, diabetic retinopathy, breast cancer, and schizophrenia. AWS launched Amazon HealthLake in December 2020, a HIPPA-compliant tool that enables users to aggregate, search and analyse data to make more precise predictions about the health of their patients and populations. In the past year alone, over US$13.8 billion has been poured into companies and projects to bring the power of machine learning to drug discovery.

THE DATA ACCESS CONUNDRUM

According to McKinsey, culture and mindset, organisational structure, and governance are common roadblocks to digital adoption in the healthcare sector. Dig a little deeper and you’ll find that one of the biggest barriers that‘s stifling progress and slowing down innovation in healthcare today is the private data access problem.

the data access problem in healthcare — Source: https://owkin.com/federated-learning/data-privacy-healthcare/

The Interoperability issue

One half of the data access problem is known as the ‘interoperability issue’. Put simply, data is the lifeblood of AI-powered research. Algorithms are only as good as the data we use to train them. Accessibility and availability of high quality data is the first step in innovation. The biggest AI breakthroughs were achieved thanks to large open datasets (e.g., ImageNet, AlphaGo and Deep Blue).

The existence of data in itself is not the problem. In the healthcare sector, medical data is abundant. Today, 30% of the world’s data volume is being generated by the healthcare industry and expected to grow by 36% a year by 2025. The World Economic Forum estimates that hospitals produce 50 petabytes of data per year. Yet 97% of all global data produced by hospitals each year goes unused.

The reason? Data is often extremely siloed and highly regulated, and there is a lack of common standards for connected health services. Interoperability can bring patients’ records together from a range of systems and provide access to data from disparate sources, thus enabling greater visibility, research and innovation.

Interoperability standards (such as the global Fast Healthcare Interoperability Resources (FHIR), and the U.S. Digital Imaging and Communications in Medicine (DICOM) and Integrating the Health Enterprise (IHE)) have been around for decades, and others are currently being developed (e.g. in Australia). But the market is still fragmented. As a result, medical data exchanges and integration remain difficult, time consuming and costly.

The AI privacy issue

The second half of the data access problem stems from data custodians’ reluctance to share data due to legal, privacy and security concerns.

ML algorithms train on a lot of data and update their parameters to encode relationships and patterns in that data, in order to make accurate, reliable and useful predictions. In many sectors such as retail, banking, government and healthcare, data includes sensitive and personally identifiable information (“PII”) (e.g. names, addresses, age, gender, biometric data, genetic data, financial records, tax data etc).

Ideally, we want ML models to encode general patterns from the data (e.g. ‘‘Patients who smoke are more likely to have heart disease’’) rather than facts about specific training records in the underlying dataset (e.g. “John Smith has heart disease”). Unfortunately, ML algorithms do not learn to ignore these specifics by default. If we open source an ML model to make it available to, say, the medical community or the public at large, we might accidentally reveal information about the specifics of the training set. A malicious attacker might then be able to reverse engineer the model and learn private information about individuals within the dataset using common data mining techniques.

THE PROMISE OF PETs

Enter Privacy Enhancing Technologies (‘PETs’), a set of emerging machine learning technologies that address privacy concerns. PETs are hailed for their ability to minimise personal data use, maximise data security, and empower developers to build privacy into their models. Gartner named privacy-enhancing computation as one of the top strategic technology trends for 2021 and predicts that by 2025, 50% of large organisations will adopt PETs for processing data in untrusted environments and multiparty data analytics use cases.

Google has been leading the way in recent years with RAPPOR (Randomised Aggregatable Privacy-Preserving Ordinal Response) together with LinkedIn’s PrePeARL framework (privacy preserving analytics and reporting) and Apple’s and Microsoft’s differential privacy deployment.

The British Government Centre for Data Ethics and Innovation has recently published an Adoption Guide for PETs intended to help organisations consider how PETs could unlock opportunities for data-driven innovation, whilst protecting the privacy and confidentiality of sensitive data.

Today, PETs are being used by many companies and governments in a variety of sectors and contexts to not only protect sensitive information but also cement better privacy practices and enhance digital trust in the market. Orange, SalesForce and NVIDIA are real world examples of how PETs can be harnessed to protect customer, patient and institutional data while building groundbreaking products.

PETs cover a range of technologies, from relatively simple ad-blocking browser extensions to the Tor network for anonymous communication, divided into two main categories: traditional and emerging.

Traditional PETs are well-established privacy techniques, such as encryption schemes that secure information in transit and at rest, and de-identification/anonymisation techniques such as tokenisation and k-anonymity. The trouble with anonymisation is that there is always a residual risk that data can be re-identified by linking it with other datasets, inferring information from proxy variables, or by applying advanced data mining techniques. While traditional privacy techniques may feel familiar and “good enough”, they often de-value data and fail to protect privacy. The de-anonymisation of Netflix’s data is a case in point.

Emerging PETs, on the other hand, are a group of technologies that provide novel solutions to privacy challenges in modern data-driven systems. They achieve similar privacy goals but enable deriving higher value from data. Primarily this category refers to five technologies: homomorphic encryption (HE), trusted execution environments, secure multi-party computation, differential privacy, and federated data processing.

Homomorphic encryption helps organisations securely and privately share data across jurisdictions or internal/external data silos by allowing operations (searches or analytics) to be performed without exposing the interaction with the underlying data.

Entities can securely collaborate in a decentralised/distributed manner without replicating or moving data between jurisdictions, all while prioritising data privacy. This saves significant time and resources and reduces operational risk relating to the possible mishandling of sensitive or regulated data.

WHAT IS DIFFERENTIAL PRIVACY?

Differential privacy is an emerging PET that is gaining momentum. It allows statistical analysis without compromising data privacy.

Google has long been advocating differential privacy as a tool to provide its developers and customers with greater access to data and insights while keeping people’s personal information private and secure (this is now open sourced, as is Tensorflow Privacy, Tensorflow Federated, Private Join and Compute). Differential privacy was deployed in the Chrome browser seven years ago, and has been used in Google Maps, Assistant and Google Play.

Differential privacy’s premise is that you can query a database while making certain guarantees about the privacy contained in the database. This is achieved through a mathematical idea called randomising (or adding ‘noise’ to) part of the algorithm’s behaviour.

The rationale for introducing randomness to an ML algorithm is to make it hard to tell which behavioural aspects of the model defined by the learned parameters came from randomness and which came from the original training data.

Differential privacy requires that the probability of learning any particular set of parameters from a given dataset stays roughly the same if we change a single training example in the training set (i.e. if adding, removing or changing the values within one training example doesn’t impact the algorithm’s output). Intuitively, if the output of a function remains the same if a record is removed from the dataset, that output is not conditional on that record’s private data, and privacy is respected.

using differential privacy to protect health data — Source:https://www.winton.com/research/using-differential-privacy-to-protect-personal-data

Differential privacy is achieved when an adversary is not able to distinguish the answers produced by the randomised algorithm based on a partially changed dataset from the answers returned by the same algorithm based on the full original dataset.

The base tradeoff in differential privacy is this: how much noise/randomness can be added to the dataset to give the most amount of privacy while still delivering accurate results.

The question is whether we can still answer important questions while using AI privacy techniques. In other words, can private AI models still be accurate, effective and useful?

Differential privacy has now been endorsed by experts as a regulariser capable of addressing some of the problems commonly encountered by ML practitioners — even in settings where privacy is not a requirement, including in dealing with overfitting, to create ML models that generalise well. After all, ML is all about searching for latent patterns or trends that occur across multiple records in a dataset, rather than identifying one-off occurrences. Thus differential privacy is said to be compatible with ML’s general approach to unearthing insights in data, and useful for improving model accuracy as well as protecting privacy.

THE FUTURE OF AI PRIVACY

Privacy today is big business — Crunchbase estimates that the privacy tech sector amassed over US$10 billion in investment in 2019 alone. The global market for federated learning (another type of PET) is projected to grow to US$201 million by 2028 (at a CAGR of 11.4%), driven by the potential to to leverage a shared ML model collaboratively while keeping data on devices, and the ability to enable predictive features on smart devices without impacting user experience and leaking private information.

Concepts such as ‘privacy-first’, ‘digital trust’ and ‘responsible AI’ are becoming mainstream across all industries, driven by increased regulation and compliance imperatives (GDPR and other similar regulations around the world); increased consumer awareness and demand for better privacy practices; and privacy by design becoming an integral part of product development best practices.

As privacy enhancing technologies become more prevalent, mature and integrated in good data engineering practices, they can help realise AI’s potential to tackle some of humanity’s greatest problems and be used in real world scenarios (the US Census Bureau used differential privacy in its 2020 census). However, PETs are not a silver bullet and should be used alongside data protection best practices, adequate data security and robust data governance.

Many important questions, in healthcare and beyond, require troves of personal data to be answered. No doubt, privacy technologies will have a key role to play in future innovation. It is important that we continue to encourage collaboration between data custodians, governments and the AI community, and build secure and robust data privacy infrastructure, so that information on the world’s biggest problems becomes more accessible.

________________________________________________________________

For more technical details on the mechanics of a differential privacy:

Privacy Preserving AI (a lecture by Andrew Trask of OpenMined) — MIT Deep Learning Series
OpenMined’s blog and YouTube channel — an open-source community whose goal is to make the world more privacy-preserving by lowering the barrier-to-entry to private AI technologies.
A technique called Private Aggregation of Teacher Ensembles (PATE)
How to implement an anonymisation algorithm using the cn-protect python module from CryptoNumerics, an awesome company focused on data privacy solutions.
The Algorithmic Foundations of Differential Privacy by Cynthia Dwork (Microsoft Research) & Aaron Roth (University of Pennsylvania).

Privacy Enhancing Technologies and why they’re vital for healthcare innovation

Written by Rachel Dulberg