GDPR does not only impact the work of data teams — a data scientist perspective

Published in

dunnhumby Science blog

5 min readAug 29, 2019

For more than a year, the General Data Protection Regulation (GDPR) has been part of European law and has been offering a new range of protections and rights to consumers. As with all companies hosting and using consumers’ data, dunnhumby has invested a lot of time and effort in preparing for the GDPR and, since it went live, to ensure continued compliance with the new regulation.

Data scientists may assume that the GDPR will not drastically impact their work, as long as they: (a) work on projects covered by the consent agreement; (b) are using the most up to date version of the data; and © are following governmental and internal policies. However, these steps, while necessary, may not cover all the consequences of the GDPR. To illustrate this point, I will focus on how the rights to rectifications and erasure (article 16 and article 17) could impact model maintenance and reproducibility.

I hope this article will motivate data scientists to think more about the implications of the GDPR on their work and use this to inform their practices. Now, let’s get into it.

A new data life cycle

With the establishment of the GDPR, European consumers now have real leverage to force rectifications or erasure of their personal data. It is still too soon to assess how the new rights will be used, but personal data are much more likely to go through a dynamic life cycle (including the potential turnover or “loss” of substantial percentages of consumers data). In this article, I will refer to a dataset with significant amount of rectifications or erasure request as a “dynamic” dataset. This new data life cycle has the potential to impact different aspects of data scientist work such as model maintenance and reproducibility.

Side note on data anonymization

In an ideal world, data used by data scientists would be anonymized and so the GDPR would not apply. However, there are currently no guidelines describing how to generate an anonymized dataset. It is likely that such guidelines will never exist as work done on re-identification and k-anonymity suggests that anonymizing while maintaining the intrinsic data properties may be impossible (see [1] & [2]). Therefore, anonymization will not be considered here as a viable solution.

Model maintenance

Over its life span, a trained model may lose some of its predictive power. To solve this issue, it is crucial to perform the necessary maintenance. A “simple” solution, even if potentially computationally heavy, is to retrain the model using the latest version of the data. For growing datasets, retraining a model from scratch is possible without information loss. However, a highly dynamic dataset, with some of the original training data may not be available anymore (information loss), which could impact the power of the re-trained model.

If the aim is only to train a model on the latest dataset, then retraining is not an issue (assuming that you still have enough data). However, it can be problematic if a trained model contains valuable information that will be lost by re-training it. It is not all doom and gloom as some solutions already exist to avoid the information loss, however it requires planning for it as they may not be implementable a-posteriori. Here are two potential solutions:

1. Incremental training

Training “dynamic” datasets shares similar properties to training on streaming and, to an extent, “big” data. In these scenarios, data cannot be loaded or accessed all at once. One of the solutions developed to cater for this is incremental learning, allowing the training of models using blocks of data. Using an incremental learning approach over a GDPR regulated dataset would allow the trained model to be updated only using the data acquired since the last training cycle; not losing the “knowledge” already embedded in the model.

2. Ensemble modelling

In this approach, different models are used to generate predictions. The different predictions are aggregated into a final prediction using ensembling methods (e.g. averaging, majority vote, bagging, and stacking). With this approach, different models could be trained over time using the data available at the time. All trained models can then be used to generate predictions. These would then be combined by the ensembling method to generate a final prediction.

It is important to highlight that if a trained model allows consumers’ re-identification it must be cleaned of any data of consumers who requested erasure or be in breach of GDPR. This implies that the methods suggested above would not be usable for these types of models.

Reproducibility

Not being able to reproduce or even repeat past results may cast doubts on the quality of a piece of work done. With the new regulation, a dataset used commercially cannot be kept as a static copy ignoring user requests. Data with dynamic life cycle could render difficult, or even impossible, the recreation of an historical dataset making repeatability and reproducibility difficult or even impossible. So, how can this be solved?

1. Pre-evaluate the impact of data variability

In this framework, predictions would be generated with different subsets of the data and compared to other predictions made with the full training dataset. The comparison should provide a view over the robustness of the trained model. This is very similar to cross validation and can be seen as an extension or variation of it. As long as the variation of the live data is being monitored, it will be possible to identify when the data has diverged too much to allow reproducibility. Technically speaking this does not solve the problem at hand but it should trigger discussions with the different stakeholders reducing potential issues in the long run.

2. Simulate data

Always use at least one set of simulated data as part of the evaluation process. As simulated data does not contain personal information, it can be archived or regenerated on demand. This provides a set of data for which results can always be reproduced. Unfortunately, it is not always possible to generate data mimicking the characteristics of the real data, potentially limiting the value of this approach.

Summary

If your dataset is likely to have a “dynamic” life cycle, anticipating the potential impacts of this “dynamic” life cycle on your models/solutions is crucial. Choosing a given modelling framework could impair your ability to perform model maintenance. Not planning for reproducibility a-priori may leave you in a situation in which a stakeholder challenges your work, but you have no way to reproduce past results as the necessary data is not available anymore.

The potential solutions I highlighted are not new or revolutionary, and should not be used blindly. However, as many of these solutions must be implemented a-priori, it is crucial to identify the risks and deal with them in due time.

I do believe that the GDPR is a positive move, as a European consumer and as a data scientist. The point I hope to convey is that the GDPR has the potential to impact the ways we perform our job as data scientists. We should not ignore it and hope for the best. It is our responsibility, as practitioners, to pro-actively engage with it. By identifying the potential consequences of the GDPR and tackling them, we (the data science community) can nurture the future of this field.

1: http://digital.law.washington.edu/dspace-law/bitstream/handle/1773.1/417/vol5_no1_art3.pdf

2: https://www.uclalawreview.org/pdf/57-6-3.pdf