Applications in the context of artificial intelligence often depend on vast amounts of data. This use of artificial intelligence creates obvious conflicts with the GDPR. This article outlines backgrounds and possible solutions.
Machine Learning as Subcategory of Artificial Intelligence
It is impossible to define artificial intelligence (AI), since it is a marketing buzzword, similar to Cloud or Industry 4.0. We may think of AI abstractly as referring to applications that perceive, understand, act and learn — without being too concrete. In the 1980s, Elaine Rich described the research field of artificial intelligence as “the study of how to make computers do things at which, at the moment, people are better”. This summary makes you realize that various techniques can be subsumed under AI, but all of them share a similar purpose.
One subarea of AI, which is certainly one of the more interesting ones from a data privacy point of view, is machine learning and, as part of the latter, neural networks.
Models as aggregated knowledge storage
Machine learning is based on data, so-called training data, which is used to train an algorithm. This training data contains, for example, orchestra photos and the classification “cellist”, which marks the photos in which a cellist can be seen. The knowledge acquired from such pattern matching is stored as a model. Contrary to popular belief, a model consists of a mathematical structure with numerical values, for example with different layers of points and weighted connections between such points in the case of a neural network. On the basis of such a model, in the example mentioned above, a musician can be identified as a cellist on an unknown photo of an orchestra.
Legal basis for the training
The obvious issue with this example is the training of a model based on the training data. Amongst the innumerable data protection questions only one should be dealt with here: the legal basis for the training will typically not lie in the fulfillment of the contract (Art. 6 (1) 1 lit. b GDPR) and consent (Art. 6 (1) 1 lit. a GDPR) is excluded due to reasons of acceptance. Essentially, the weighing of interests under Art. 6 (1) 1 lit. f GDPR remains, in connection with the privilege for statistical processes in Art. 89 (1) GDPR and Art. 27 (1) German Data Protection Act (BDSG). The decisive factor for the necessary weighing of interests is the individual case taking into account, for example, the criticality of the data concerned. In cases where the training data originates from different data sources (e.g. smartphones), the argumentation requires special attention, because here the training data would first have to be collected centrally with the provider before being used for the training process. In practice, three approaches have emerged:
1. Anonymous Data
The first solution would be to anonymize the training data before training. Although the GDPR is no longer applicable if the data has been anonymized sufficiently, this solution is generally not suitable, as it sometimes requires substantial intervention in the training data, which usually results in the loss of valuable information. This is in particular the case because data is not only personal in case it contains links to natural persons, but also in case inferences can be made about natural persons. The amount of data that has to be deleted during anonymization can therefore be enormous — limiting the effectiveness of trained models.
2. Synthetic Data
Another way is to artificially generate data and train with such synthetic data which means that the GDPR is not applicable at all. But again, the practical problem arises that the statistical nature of the synthetic data must sufficiently resemble the original data, otherwise the model would not be able to make precise predictions and would perhaps — in the example above — recognize violists as cellists. In order to minimize this information deficit, the original data would have to be used to generate the synthetic data, but this only shifts the problem.
3. Federated Learning
Federated (Machine) Learning has emerged as another approach to solve the privacy issue. In this case, the data which is stored in different places is not added to the centrally running algorithm, but the other way around: Each local computing environment of a data source (e.g. a smartphone) trains with its own training data to compute a so-called local model based on a global model. After the training, the local model is loaded centrally with the provider, who aggregates it with the local models from various other data sources and generates the global model from those local models. The global model in turn is sent to the local environments of data sources that use it as a basis for the subsequent training. This cycle is repeated until the global model has reached a sufficiently good quality. The aggregation can be arranged with cryptographic techniques in such a way that the aggregating party does not learn anything about the values of the local models but can nevertheless correctly aggregate them. And the aggregation can assure that the global model itself no longer contains personal data. Therefore, this party does not come into contact with personal data in any case.
The advantage of Federated Learning is obvious: personal training data never leaves the local data source. Only models that are typically qualified as anonymous under the GDPR are shared. As part of the weighing of interests according to Art. 6 (1) 1 lit. f GDPR, this approach can support the controllers’ arguments for a legitimate interest as long as the latter can ensure, by applying appropriate measures and in accordance with the state of the art, that access to local models by third parties is practically impossible.
Summary and Conclusion
Machine learning depends on a variety of training data to train good models in a way that predictions can be made for unknown input data. Besides the anonymization of training data and the synthetic generation of training data, Federated Learning appears to be a promising possibility to create AI models based on personal data in accordance with data privacy law.
As with other AI procedures, machine learning should only be used if the right course has been set for the training regarding data privacy. If utmost care is not applied at this stage, one risks that models become infected with personal data and so fall under the scope of the GDPR and then cannot be used anymore, e.g., because no legal basis is available or because data cannot be deleted from the model although its removal would be necessary to comply with the right to be forgotten.