Semantic anonymisation for databases via django

Create GDPR-compliant database dumps for test environments and local development

Published in

ambient-digital

6 min readDec 20, 2019

What this is all about

Latest since the arrival of the European General Data Protection Regulation (GDPR) on May 25th 2018, people started worrying about how to handle sensitive data.

One of the main issues being addressed with GDPR is that if you store (or process) personal data of some kind, there needs to be a justification why you do it. Furthermore, if you lose personal data of some kind, this needs to be reported and made public. Usually a thing you want to avoid at all costs.

Therefore — in theory — only a carefully selected number of administrators get access to the highly sensitive database. But in practice you need a valid dataset for your test and staging server, your developers need one to work on locally etc. And this is exactly the point where this article sets itself up.

If you just look for the way to do it (nobody will blame you), please scroll down to the part after the Rubik’s cube.

Test dataset: Production data?

There are multiple ways to create a good, working and valid dataset which you can use for test servers and local development.

The obvious solution is taking the production data. On the plus side, you will be able to reproduce all bugs related to the current state of your data. Furthermore, it is easy to get. Just dump it from the database server. But if one of your team members laptop gets stolen — even with enabled hard drive encryption — you might have exposed personal data. This theoretically needs to be reported (what you do not want to happen).

In addition, in most data processing contracts you ensure that sensitive will only be handled by your administrators. Sure, every intern can be an administrator…

In a nutshell: The more sensitive your data is, the less you want to use your production data for any other purpose than your production system.

Test dataset: Fixtures?

So if you cannot take the real thing, the next best thing is manually created data, right? If you create fixtures or if you start in an empty system and build data manually — in both cases you will have a hard time generating all the data you need to simulate the production system.

Simple example: Imagine you maintain an online shop. Every day people buy goods, invoices are created, dunning letters are being sent. You start hacking all this data into your system, making up customers, fake purchases. After hours, everything looks neat. Great! But then, you work for a week, a month, several month on the project. The created data will become stale and decrease to be close to a production-like state. Just picture all business logic targeting current data like statistics. When all test data timestamps are six month in the past, those features will be hard or even impossible to use.

If you go for implemented fixtures, it’s basically the same. Certainly, you can implement some logic which makes sure certain cases are always like you need them to be. But the more fancy and convenient your fixture setup gets, the more work you have to put in. And if you change the code of your system, you probably will have a ton of adjustments there as well.

Test dataset: Anonymisation?

So, if we cannot use the production dump but creating everything from scratch is troublesome as well — what can we do?

The keyword we are looking for is “anonymisation”.

The procedure is quite simple. Your admin takes the production dump, replaces all the sensitive data like names, emails, bank accounts etc. with something generic.

A real plus is the fact that you keep your original ids. So if a bug is reported for the record with ID 27, you should still be able to reproduce and therefore fix it.

As we see, the benefits are obvious. Unfortunately, imagine you see only hashed values everywhere. At some point, it gets quite hard to get the same look and feel like on the production system.

Luckily, there is a solution for this problem as well…

The way to go: Semantic anonymisation

Working frequently in the django ecosystem, we found the django-scrubber package a couple of months ago . The great thing about that tool is, that you can define all the fields which contain sensitive data and not just hash or empty them but fill it with data having the same meaning like your production state.

Imagine, we have a django model for your customer data:

# models.py
from django.db import modelsclass Customer(models.Model):
    first_name = models.CharField('First name', max_length=60)
    last_name = models.CharField('Last name', max_length=60)
    last_login = models.DateTimeField('Last login'))

If we take a closer look, we can see that the fields first_name and last_name contain sensitive data whereas last_login is quite uncritical.

With django-scrubber, we can define a subclass within the model like this:

# models.py
from django.db import modelsclass Customer(models.Model):
    ....    class Scrubbers:        
        first_name = scrubbers.Faker('first_name')
        last_name = scrubbers.Faker('last_name')

If we now run the management command scrub_data provided by scrubber, the package knows which fields to handle and how. When you anonymise the dataset, scrubber will pick a random first name and a random last name for every customer record you have in your database.

Scrubber utilises the Faker package which provides an abundance of helpful data types, like job descriptions, street names and many more. And it can even provide localised (language-specific) data! You can read all about your options in the Faker documentation.

In addition, scrubber itself provides a handful of useful tools like empty values or simply hashing the existing value which you can read about here. A really nice feature I would like to point out here is the value-casting. Faker only generates strings which the django ORM will not save in a field type different from char or text field. Scrubber tries to cast the faked values, so they fit to the declaration in the django model.

Limitations

As always when you tackle a complex problem automatically, there are some drawbacks.

At first, you might want to fix a specific bug being caused by a specific combination of values. May it be a rounding issue or a long text breaking the design — if you alter your data you might alter away a bug you are trying to narrow down. Unfortunately, this limitation is intrinsic, and we cannot do anything about it.

Secondly, when your field values are not atomic, meaning they have a semantic connection to other fields, this approach won’t work as well. A good example is an invoice having several invoice positions. When you start anonymising the value of the positions, this might affect the invoice. In this case a simple scrubbing with this package is insufficient.

Handling all the special cases

Usually you want as little hassle as possible when creating an anonymised dump. So manipulating all those special cases as mentioned above and afterwards creating test users, forwarding credentials etc. is a thing you surely want to avoid.

For this reason we implemented an abstract class in our “ambient-toolbox” package called AbstractScrubbingService .

At pypi you’ll find the details for installing the package. Once you added the package, you can create a service similar to this:

# services.py
from ambient_toolbox.services.custom_scrubber import AbstractScrubbingServiceclass MyScrubbingService(AbstractScrubbingService):
    pre_scrub_functions = [
        'remove_some_data',
    ]

    post_scrub_functions = [
        'scrub_users',
        'delete_logs',
    ]    def remove_some_data(self):
        pass    def scrub_users(self):
        pass    def delete_logs(self):
        pass

The service wraps the general scrubbing command and at the end truncates the scrubber data table. This table contains preprocessed information to speed up the scrubbing process. You do not need it afterwards, though. We do this so the database will be as small as possible for any kind of export.

Furthermore, you can create functions for handling data in any way which run before or after the scrubbing. You can see this in the example above. Only register the functions in the class attributes pre_scrub_functions or post_scrub_functions and implement it. Done!

Finally, you need to create a new management command called custom_scrub . Here is an example implementation:

# custom_scrub.py
from django.core.management.base import BaseCommand
from apps.core.services import MyScrubbingService


class Command(BaseCommand):

    def handle(self, *args, **options):
        scrubbing_service = MyScrubbingService()
        scrubbing_service.process()

Conclusion

As far as we know you can tackle any given problem with the shown approach. Naturally, if your database contains complex semantic dependencies or other special cases, the overhead of writing the scrubbing logic increases. But compared to the other solutions around, I still believe that this is the way to go.

I would be happy to get some feedback and comments. It feels to me that since the djangoCon talk the django ecosystem has been exceptionally quiet about this issue.