An introduction to XAIN’s GDPR-compliance Layer for Machine Learning

XAIN
XAIN
Aug 23 · 7 min read

In 2017 XAIN, was founded with the vision to build an infrastructure network that provides scalable machine learning. Since then, the company has undertaken a lot of research in the field of privacy-preserving AI and collaborated with enterprise partners in various pilot projects. Interdisciplinary academic-grade research still continues to be our DNA and has been our distinguishing feature ever since. We are now ready to speak openly about these developments and our technological advancements. This article is an introduction to our endeavours towards machine-learning technology that is compliant with regulations and enables enterprises to bring their AI applications to next-level production grade.

In general, machine learning applies methods and techniques from the field of statistics in order to solve two related problems:

  • Model Training: Data is prepared and subjected to training via an algorithm that learns a mathematical model. This model represents patterns identified in the training data.
  • Model Inference: the learned model is subjected to new data input in order to make inferences based on such learned patterns for decision support.

But what is Federated Machine Learning (FedML) and why is it important?

FedML is a form of distributed machine learning that was pioneered by people at Google in a use case for the autocompletion of search queries entered on handheld devices. Due to this use case, FedML is often thought to be about data from small devices. However, the technology we are building at XAIN expressly includes use cases that involve large datasets hosted on enterprise servers.

In a nutshell, FedML is a form of distributed learning involving a set of Clients and a Parameter Server. The latter coordinates the learning process: Clients compute updates of their local models by learning from their own local datasets and then communicate these updates to the Parameter Server. That server then aggregates these local updates into an updated global model. That global model is sent to all Clients who then start that same process again in another iteration until the learning terminates with a shared global model. You can visit our website for further information.

Why does XAIN develop GDPR-compliant AI technology?

At XAIN, we recognize that FedML has huge potential as an enabler of AI and AI applications in industrial sectors, above and beyond data from small devices. Company data that could be used for training purposes often resides in data silos separated across physical boundaries (for example different IT systems) or political or legal boundaries (for example across departments, subsidiaries or consortia). These factors create real obstacles to combining such data for effective learning. Trade secrets and regulation, notably GDPR, are often key factors that prevent AI applications from going into production in industries.

In order to resolve these challenges, XAIN is building infrastructure solutions that are based on Federated Learning and built for enterprises. FedML allows these companies to get the best of both worlds: meeting regulatory demands such as those of GDPR and enabling the learning of valuable insights for decision support. Therefore, we believe that our infrastructure solutions build a much needed bridge that enables enterprises to bring their AI applications into production. Continue reading to find out how FedML can reconcile these seemingly incompatible needs of data-privacy compliance and deployment of AI in enterprises.

What does XAIN do so that FedML enables AI adoption?

  • All Clients share the same model structure and learning algorithm. XAIN’s FedML API and its connection to local datasets is all that companies will need.
  • XAIN’s FedML supports automated machine learning, for example learning a deep-learning network topology with Efficient Neural Architecture Search.
  • The trained global model captures insights from local datasets at practically the same quality as learning over an aggregation of all datasets would.

The aggregation of all local datasets is precisely what is often not possible in practice. So if model quality is not impaired by FedML, where is the catch? Well, if FedML technology won’t interfere with regulatory constraints, then there is no catch!

What does XAIN do so that FedML meets regulatory demands?

We focus on the EU GDPR as a key regulation around data privacy and data protection. XAIN has partnered with a leading global law firm to ensure that the XAIN FedML technology is designed, implemented, and operated in a manner that is compliant with GDPR. We think of this as a co-evolution between the design of technology and the interpretation of law. Such co-evolution will ensure that FedML technology is production ready and will meet the scrutiny of IT security and data privacy departments of potential customers and regulators. We are very excited about this collaboration and will announce key results soon. Here, we offer a sneak preview of how FedML can meet GDPR compliance.

Preliminary findings:

Recital 26 of the GDPR states that anonymous information is not subject to GDPR. But what qualifies as being “anonymous”? Put simply, any information which one cannot relate to a data subject — as an identifiable or identified natural person — is considered anonymous. How to show that “one cannot relate …”? The law says that this is demonstrated by focusing on the “reasonably likely means used” for making such a relation. Put simply, if we can show that such relations cannot be made with state of the art tools and within reasonable costs or timelines, then information is deemed to be anonymous.

Why bother with FedML, why not simple anonymize and aggregate datasets?

This is a very valid question. There are a range of anonymization techniques available. But these transformations involve manual work and so are costly and do not scale well. Moreover, GDPR sets very high bars for showing that information is anonymous and most anonymization techniques have weaknesses in that regard and often cannot avoid that a third party deduces, with significant probability, the value of an attribute from the values of a set of other attributes. Also, for effective learning, anonymization of raw data cannot remove too much structural information. But companies may be reluctant to share anonymized datasets that exhibit structure of raw data with other companies. These are only some of the issues with this approach to GDPR compliance. In our view, the anonymization of datasets therefore cannot effectively enable AI and its wide adoption in industry.

FedML, however, is able to do this. Datasets stay where they are and so do not have to be transformed in any special manner. Local models are updated based on learning from local datasets only. It is conservative to assume that local models are personal data — at least in many use cases. Communicating local models to a
Parameter Server is then potentially problematic, since the Parameter Server has to process (that is, aggregate) local models — personal data — from all Clients into a global model.

So how can FedML meet GDPR if local models are personal data?

Fortunately, we can design the communication and aggregation protocols in such a manner that

  1. Locally updated models are “private inputs” to the aggregation process.
  2. Updated global models pass legal tests for being anonymous information.

“Private inputs” is a technical not a legal term, meaning that other Clients and the Parameter Server cannot learn the value of this Client’s input — her local model. The legal tests require some care in boundary cases. For example, when only two clients learn together, then one client can compute the other client’s local model from her own local model and the global model if these models are averaged without the addition of noise or other protection measures.

One key advantage of FedML over anonymization of datasets is that the privacy-preserving aggregation of local models converts personal data (the local models) into anonymous information (the global model). Local and global models share the same structure, a vector of real numbers. But this structure is not what makes local models personal data. It is the concrete real numbers in their vectors that do this. And the vectors of the global model are such that one cannot deduce the real values of the vectors that were aggregated. Let us illustrate this with a simple analogy. Say you are 175cm tall and you are told that the average height of 10 people including you is 180cm. Then you know that at least one of the other nine people is taller than you. But you may be the smallest or the second largest of the ten people!

Thus, we can argue that the global model does not fall under the scope of GDPR. Understanding the intellectual property rights pertaining to such global models is a separate issue though, and one that we are happy to discuss in a future article.

What use cases does XAIN apply this to?

In general, our AI technology is enabling any use cases that face issues when wanting to use privacy-sensitive data for AI training. We are eagerly working on the design of features and user experiences for this infrastructure-as-a-service to make it universally applicable by enterprises and app developers, but also by academia and AI startups.

As a first use case, we are currently in the process of deploying our AI application ANDY in a large German company. ANDY, an acronym for Anomaly Detection, supports accounting workflows around invoice processing in enterprises. We will use ANDY as a first AI app based on our FedML technology. ANDY provides a solution to the accounting industry that could reap tremendous benefits from AI-backed automation and, at the same time, holds some of the most sensitive data firms own, namely supplier invoices. Backed with our technology, companies who will use ANDY and will profit from pre-trained, yet privacy-proven AI models. ANDY comes as Software as a Service and has tremendous potential of becoming a standard AI product for accounts payable departments.

You can find out more about XAIN, ANDY, and our FedML work on our website.

XAIN: The eXpandable AI Network

XAIN, the eXpandable AI Network, is a Berlin-based technology company. We tackle the dilemma of unlocking the full power of AI without compromising data privacy and build solutions that bring enterprises to the forefront of AI utilization.

XAIN

Written by

XAIN

The eXpandable AI Network

XAIN: The eXpandable AI Network

XAIN, the eXpandable AI Network, is a Berlin-based technology company. We tackle the dilemma of unlocking the full power of AI without compromising data privacy and build solutions that bring enterprises to the forefront of AI utilization.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade