Ecto_anon: Our open-source library for anonymizing data easily
As a European company, Welcome to the Jungle has to be GDPR compliant, which includes, among other things, ensuring our users have the right to erasure. And as we are engineers, we didn’t want to settle with our existing manual solution and so tried to find a more efficient way to do this while taking into account the technical requirements of the entire tech team. This led us to release our own open-source library for data anonymization, which we called ecto_anon — built in Elixir, our main backend language — and which comes with ecto integration. In this article, we are going to explain in more detail why we ended up creating our own library and how it works.
The right to erasure
Ever since GDPR entered into application in 2018, companies have had to adapt and enforce a firm stance regarding their data privacy and security and are now required to take data protection into account “by design and by default.”
This has guaranteed many things — among them, stronger user rights’ protection and companies being educated on how to respect their users’ fundamental rights to privacy and data protection when using or processing their information.
There are several noteworthy provisions in this regulation, including the right to erasure, also known as the “right to be forgotten”. Article 17 of the regulation stipulates, among other things, that when the data’s subject withdraws their consent or no legitimate reason to retain the data or exemption applies, the data should be erased within one month, provided the subject’s identity can be verified. The process of erasure can be performed via unrecoverable deletion or anonymization, which means individuals are not or are no longer identifiable and can never be re-identified.
It should be noted that the scope of GDPR is really wide and that this article and our library focus only on the right to erasure.
How we used to do it (and why we don’t do it like that anymore)
Before our user volume reached today’s figure of millions, we simply used to delete everything in our database, our S3 buckets, and so on.
Whenever a user contacted us, we had to perform a hard delete manually, which caused several problems:
- It was necessary for users to send an email to request deletion of their account.
- There was no history of user deletions.
- The process was extremely time-consuming.
- As it was a manual process, some users had to wait a while before the deletion actually took place.
- We couldn’t scale the process as we grew.
As our data team grew, new requirements came along — among them, the need to avoid discrepancies in our total number of users caused by the hard-delete method being implemented. Just deleting our users (an extremely simple yet effective solution) meant that completing data analysis further down the road became much harder to do.
Therefore we needed to find a new way to remain GDPR compliant, maintain the history of user deletions more efficiently, automate this process, and make it easier for our data team by providing a soft delete.
Creating our own library
So we needed to find a simple way to anonymize our data, with a maintenance ease every time a new personal field was created on our database.
At the time, ecto was being used extensively as a database wrapper within our projects but also overall within the Elixir community, so our first move was to search for a library that could anonymize our fields easily. But we couldn’t find anything!
That’s how we ended up creating the ecto_anon library!
How does ecto_anon work?
As we were already defining schemas with ecto to interface with our database, we created an anon_schema to specify all the fields we would want to anonymize:
And when a user wants to delete their account, we simply run the library and let it anonymize their data based on each field type:
What about associations?
Running the same function for every association related to a user can be tiring. So we quickly created a cascade option to propagate data anonymization for everything related to our users that requires anonymization, as long as it’s specified in the anon_schema:
And then we simply run the cascade option:
We’re also currently supporting anonymization options like random UUID generation and pseudo-anonymization with partial data anonymization, custom options to fit everyone’s needs, helpers to filter anonymized entries, and migration helpers. You can read more about that on our README.
Our usage of ecto_anon
Our deletion process is now enhanced thanks to our new library. The following process starts as soon as a user requests deletion via their profile:
- The user deletion request is logged in a dedicated table.
- Anything related to the user (session cookies, uploaded files, email alerts, etc) is deleted.
- Ecto_anon is used to anonymize the user and anything connected to them.
- Log request completion is timestamped.
- All DPOs (Data Protection Officers) of the companies the candidate applied to are emailed.
By the end of all this, the user is completely anonymized and we are not able to re-identify them.
This library is currently being used in production on welcometothejungle.com and has already been used to anonymize thousands of users.
We are achieving productivity, efficiency, and scalability as we attract increasing numbers of users, since it’s now very easy to anonymize a new personal field when deleting a user.
We have also strengthened our GDPR compliance in addition to fulfilling the wishes of our data team. Things are still in their early stages and we already have plenty of ideas for improving the process.
We would love to receive any feedback and contributions that would help us — and others, too — so please don’t hesitate to contribute!
Written by Clément Quaresma, Back-end developer @ WTTJ
Illustration by David Adrien