GDPR: A Call to Remove Technical Debt from Data Science

For those of you here for the TLDR, here ‘tis:

  • Utilize Data Provenance
  • Document how you are Using Customer Data
  • When in Doubt, Delete or Pseudonymize (or even Anonymize!)

Happy GDPR Day! 🎉 As millions of last minute emails shot around the world on May 25th, customers, business analysts and pundits took part in conversations about what GDPR meant for businesses in Europe and the world. For data science, the new regulation has a special significance; putting boundaries, informed consent and documented processing at the forefront.

For myself and many data scientists I know, some of these new restrictions made obvious what we all know but would likely prefer not to say out loud: we hadn’t done a very good job keeping track of our data. We often didn’t document data provenance. We didn’t document our processes and processing very well (ahem, reproducibility issues, anyone?). We often didn’t delete data when users removed their account in a holistic way. These were often artifacts of legacy software systems and ETL processing — side effects of quick-moving teams working on whatever problem was next. In essence, we have some serious technical debt.

So, does GDPR allow us to take some time to clear our technical debt? Or will we simply kick the can farther down the road and wait to see how the regulators apply enforcement and fines?

Cleaning (your data) House

I would argue that taking time to reassess how we manage our data is long overdue. In an era where reproducibility, testing and deterministic, idempotent data science is greatly needed, many data scientists are still running around with long forgotten batch jobs, a smattering of scripts and legacy “data lakes” that have been through so many schema changes and aggregation no one remembers what the initial data looked like.

It’s time to take GDPR as a marker for “spring cleaning” and set some standards for how we collect, use and manage data. Let’s do some excavation and build a new foundation. In order to do so, we’ll focus on understanding data provenance and data use first.

Data Provenance: Where Did this Even Come From?

Understanding and tracking data provenance is no new topic. In fact, in a survey of data provenance techniques for science in 2005, the authors outline how large data collection like the Large Hadron Collider and other “big data” systems need new solutions and standards for collecting data at scale. Despite that early warning on large scale data collection, the lack of a cohesive and standard metadata marker for datasets has meant data provenance (outside of a few open science standards) has been a side project at best.

In my research, I came across a few fantastic ideas for tracking data provenance, including Trio-One, a database focused around data quality and provenance. A research effort from Stanford’s Jennifer Widom (who has a long series of publication on this topic), Trio-One would allow data scientists to write confidence and provenance queries, specifying data source and confidence levels directly in SQL. Unfortunately, it seems the industry might not have been ready for this project, and it remained a primarily academic endeavor.

There are other open source or framework based solutions, such as Titian for Apache Spark or Apache NiFi’s provenance repository. Depending on the frameworks you use or pipelines you have built, you might be able to simply use them to additionally store provenance information in a metadata table.

But we often have many disparate data sources, funneling them together with a data workflow system and combining them via Spark or Pandas or R dataframes and munging or wrangling, preprocessing and encoding the data. After that, we then send it along to our machine learning or statistical model or some other process which uses it as input. Sometimes we even grab public datasets or old data or models we worked with once, that are documented only in a few notes to ourselves or a commit long ago that says something like “Testing new features for recommender.”

What could go wrong, right?

Ermmm…. (Giphy Content: original from #SchittsCreek)

So now that we might see a need beyond GDPR for data provenance tracking, how exactly do we do this given our current processes?

  1. Do a data documentation sprint

Take time to have the team document data they have been using with metadata. Establish a protocol for the documentation, making it easier the next time someone creates a new dataset or utilizes older data in a new way. Remember, data should also have user consent information tracked!

2. Process your logs to automate documentation

Have logs? Use them. Build a script or query that can parse logs to trace origin, destination and routing of data through your processing pipeline. Append this metadata to the final data.

3. Talk with data engineering, software and SRE/Ops teams about tagging provenance and expiration (and implement this new system!)

You need help on this problem, and it’s best to get the conversation going with more than just one team. Figure out who can own what parts of the tracking, logging and tagging solution you develop for your data management and processing. Make sure there are clear owners and tasks for all teams and the vision and outcome are a shared goal. How will we mark users who have opted-out of processing and remove their data in the pipeline? How will we create deletion pipelines or data portability pipelines so users can remove or transfer data in and out of the system? These should be joint responsibility.

4. Mark questionable data for further review

Evaluate all data you currently use and data you have in cold storage. Write up data documentation for everything you can, specifically marking data for which you don’t know the details. We’ll get to this questionable data at a later step…

Now you have a better grasp of data provenance, hooray! That was one difficult but necessary step. Hopefully, you have started a conversation that will continue and some processes which enable you to do better data science. Understanding provenance is key in determining your data quality and trustworthiness, and now that information is a short query away. Utilize it!

Documenting and Testing Your Processing

The next step is to apply this same documentation and discussion to your data processing. But hopefully you’ve already been doing some of this, right?? You are using version control, you are using continuous integration, you are testing your models and workflows in an automated and regularized way, yes?

If not, you probably either know you should be or have never read a book on software design and should start there. 🔥🔥🔥

I’m only half joking here. If you aren’t doing these things now, please please take some time to do them. Reproducible processing, tracking errors and quality issues in our code as well as its byproducts or outcomes is a requirement for being a person who builds useful computer products (or data models) — and that’s what we want to do, right?

Need some resources to get started? Depending on your framework, there might already be some projects or tools you can utilize to help you get started. Or start with the documentation!

What information should your documentation cover?

  • How does the process work?
  • What frameworks does it use?
  • Are there any known errors, bugs or long-standing crashes that are known and open?
  • How can it be tested and debugged? (i.e. logs, test suite)
  • Who owns the process? (team, primary developer, etc)
  • What customer data does it use and in what way (in plain English)?

What should your data processing systems or workflow tools have in place?

  • Automated testing of workflow or processing changes

Workflows should be as automated as possible, with tests! I have more opinions on this, but the idea is that your workflows and data processing are software and good software design means we can automate testing.

  • Logged and alerted messaging for errors or data quality issues (at least some counters, everyone)

Errors, alerts and messaging to processing owners is essential when managing data at scale. Ensure your logs are surfaced and your errors are made clear and available for review and debugging, making your life much easier when it comes to trusting deployments and catching failure. At the very least, implement some counters to track errors and data quality over time (i.e. 40% of last batch did not pass data quality tests).

  • Flags or tags on data with potential errors or quality issues (or determine a policy)

How should potentially erroneous data be handled or data which doesn’t pass current quality or schema checks? Develop a policy or flag this data so you understand what you might need to delete or reprocess at a future date (and set this review date ASAP after the error has occurred).

  • Non-data scientist readable documentation on how user data is processed (yes, for ALL of the workflows)

Part of GDPR and just good process is to properly document your data processing for the average person or user. This is also a great way to show your team’s work and to stop answering questions of how a report or dashboard is generated. With it all documented, this means that maintaining this documentation is easy. Code change? Documentation change!

  • Documentation of other partners or “downstream” consumers of your processing

Document the consumers of your workflows, so you can also track downstream consent agreements. This is essential for managing what consent you have given for what third-party processors and is also important for determining data breaches downstream which might affect your customers. (Best policy? Provide pseudonymized or anonymized data for your partners!)

This process is a bit arduous, no doubt. But, what it will do is allow for easier onboarding of new hires, better visibility for the work that your team is doing and an easy way to add new processes or partners. The initial energy put in will pay off in time saved later (in explanations and training). Extra bonus for the removal of information silos which tend to pile up when these processes go undocumented (or as I like to sometimes say, under-documented).

When in Doubt, Delete or Pseudonymize (or even Anonymize!)

Finally, let’s address the elephant in the room: what to do with all the data and processes you can’t properly document, you don’t know where it came from or if it should be expired or if proper consent was given.

Under GDPR, data that cannot be traced back to a particular individual can be kept and used for processing. Now, as data scientists, we know that proper data anonymization is no small task. It is still unclear as to how regulators might treat anonymization and large datasets (or say a determined adversary with access to external information), but the general consensus is that GDPR does not call for differential privacy-level anonymization…at least, yet.

So, what can you do with this questionable data? One option is to simply expire and delete data and processes which should not be maintained any longer. This is likely a useful step even if you do not use the data anymore and are simply holding onto it because no one has bothered to delete it or you are a data packrat. Don’t do that! Pick it up, hold it in your hand, see if it brings you joy, and if not, please just delete the damn data.

But let’s say you are actively using the data or need to keep it as it is part of some larger process. This means you need to figure out a way to make it compliant, and therefore minimally pseudonymize it. (GDPR specifically calls for pseudonymization of Personally Identifiable Information).

You can also opt for anonymization, but it’s unlikely you can guarantee differential privacy with a fixed privacy budget for most ways we do data science presently. The utility of the data would then be somewhat short lived and keeping within a reasonable privacy budget over a long time period is difficult to guarantee for most data processes (although I am very hopeful given some of the great research coming out of differentially private machine learning and data collection: you should read more about it).

So if you can’t guarantee differential privacy within a particular privacy budget over a longer period of time, what should you do? In my opinion, you should strive to do the best you can given the constraints of your data format and current storage or stream implementations. This means taking a bit more care and effort than hashing a string and calling it a day. At KIProtect, we are building what we believe is the most easy-to-use, fast and secure pseudonymization and anonymization process for data science.

We use a process similar to dimensionality reduction to encode the data and map it into a similarly-characterized space. Therefore, we retain some of the same structure of your data, while allowing for output that doesn’t reflect any “real” part of the data. We offer pseudonymization as well as anonymization by enforcing a policy similar to differential privacy (small perturbations, suppressing some data points).

If you want to give us a try, please sign up and test out our beta API. Have another request? At KIProtect, we want to hear your problems around properly privatizing and securing your data, and build simple, fast solutions. We believe data privacy and security should be easy for your team; allowing you to continue to focus on your core competencies while trusting that your data is secure and compliant.


Whatever you do, do not simply do NOTHING with this questionable data. Waiting to see how regulators come down on larger corporations could take time and delaying compliance or the outcome of this data purgatory will slow down your team and processing in the interim. Make a determination, follow through, and make sure your CURRENT data collection is done so in a compliant and secure way.

Be proud that you have built a new way which not only brings confidence to the data you are using everyday; but also in your entire team — you are following best practices, removing technical debt and creating a culture where you can experiment, learn and know that errors don’t simply go unnoticed.

So, my fellow data scientists, take GDPR as an opportunity to cull the current technical debt, increase documentation and visibility for your work, and gain customer trust and internal accountability for data privacy and consensual data collection.