Will GDPR hamper the ability to collect training data for machine learning algorithms?

Published in

Revain

4 min readSep 29, 2018

By Olga Grinina

The EU’s new data privacy rules, the General Data Protection Regulation (GDPR), will have a negative impact on the development and use of artificial intelligence (AI) in Europe, putting EU firms at a competitive disadvantage compared with their competitors in North America and Asia. The GDPR’s AI-limiting provisions do little to protect consumers, and may, in some cases, even harm them. The EU should reform the GDPR so that these rules do not tie down its digital economy in the coming years.

Much has been made about the potential impact of the EU’s General Data Protection Regulation (GDPR) on data science programs. But there’s perhaps no more important — or uncertain — question than how the regulation will impact machine learning, in particular. Given the recent advancements in machine learning, and given increasing investments in the field by global organizations, machine learning is fast becoming the future of enterprise data science.

Does this mean that machine learning will meet unbearable obstacles? Not really. It surely won’t be prohibited in the EU after the GDPR goes into effect. It will, however, involve a significant compliance burden, which I’ll address shortly. Technically, and misleadingly, however, the answer to this question actually appears to be yes, at least at first blush. The GDPR, as a matter of law, does contain a blanket prohibition on the use of automated decision-making, so long as that decision-making occurs without human intervention and produces significant effects on data subjects. Importantly, the GDPR itself applies to all uses of EU data that could potentially identify a data subject — which, in any data science program using large volumes of data, means that the GDPR will apply to almost all activities (as study after study has illustrated the ability to identify individuals given enough data).

While a substantial number of AI’s uses do not involve personal data, the many others that do will be subject to the GDPR. Consumers who routinely interact with AI-enabled services such as personal assistants that respond to spoken queries, robo-advisors that provide automated financial advice, and movie recommendations on streaming services will be significantly affected, as will virtually every European company that processes personal data — such as payroll — and can use AI to make their operations more efficient. As such, by both indirectly limiting how the personal data of Europeans gets used and raising the legal risks for companies active in AI, the GDPR will negatively impact the development and use of AI by European companies. Despite different jurisdictions having different goals when it comes to privacy, policymakers and citizens in the EU should understand that the GDPR will come at a significant cost in terms of innovation and productivity. At a time when two major world powers, the United States and China, are vying for global leadership in AI, EU policymakers need to recognize that a failure to amend the GDPR to reduce its impact on AI will all but consign Europe to second-tier status in the emerging AI economy.

The implications of the novel regulations will mostly affect those innovative protocols that leverage AI in their algorithms using personal user data to feed the machine learning. Online reviews might be a good example here. Imagine an emerging start-up that is building the AI engine to analyze user reviews by collecting existing Yelp scores and comments that all go well under the definition of ‘user personal data’. We spoke to the rep of Revain on whether they see possible hampers: ‘No doubt the GDPR is another challenge to consider, however, we are ready to allocate resources for legal compliance’. However, the question is whether startups on early stages of development really have a budget for covering the big figures of legal costs? Most unlikely.

When the GDPR uses the term “automated decision-making,” the regulation is referring to any model that makes a decision without a human being involved in the decision directly. This could include anything from the automated “profiling” of a data subject, like bucketing them into specific groups such as “potential customer” or “40–50 year old males,” to determining whether a loan applicant is directly eligible for a loan. As a result, one of the first major distinctions the GDPR makes about machine learning models is whether they are being deployed autonomously, without a human directly in the decision-making loop. If the answer is yes — as, in practice, will be the case in a huge number of machine learning models — then that use is likely prohibited by default.

So why is interpreting the GDPR as placing a ban on machine learning so misleading? Because there are significant exceptions to the prohibition on the autonomous use of machine learning — meaning that “prohibition” is way too strong of a word. Once the GDPR goes into effect, data scientists should expect most applications of machine learning to be achievable — just with a compliance burden they won’t be able to ignore.

Do data subjects have the ability to demand that models be retrained without their data? This is perhaps one of the most difficult questions to answer about the impact of GDPR on machine learning. Put another way: if a data scientist uses a data subject’s data to train a model, and then deploys that model against new data, does the data subject have any right over the model that their data helped to originally train? At this point, however, at least one thing is clear: thanks to the GDPR, lawyers and privacy engineers are going to be a central component of large-scale data science programs in the future.

Will GDPR hamper the ability to collect training data for machine learning algorithms?

Written by Revain