GDPR compliant event sourcing
with HashiCorp Vault

Published in

Sydseter

11 min readSep 7, 2019

Events are records from the past. In the same way, as we can not rewrite the history books, we can’t remove immutable events. How then do we design microservices using event sourcing In a way that is compliant with the GDPR article 17, “the right to be forgotten”, and enforce data retention policies according to article 5?

This article is a written version of the presentation I did for HashiConf EU 2019 with the same title.

GDPR Compliant Event Sourcing With HashiCorp Vault

I have been a developer for some time, but I didn‘t understand what it meant “to be compliant” before the end of 2017 when the General Data Protection Regulations came into focus. That’s when I came to understand that we all, as European citizens, have the right to privacy and the right to data protection and that these are fundamental rights that we and our clients are legally obligated to follow.
GDPR is mainly a reiteration of the Data Protection Directive. What is new is that data protection now must be implemented by design and by default. So what does it mean to do “data protection by design and by default”?

Data protection by design and by default is derived from two terms.
One is “Data protection” and the other is “Privacy by design and by default”.
Originally the European Commission used the term “Privacy by Design” and „Data Protection by Design“ to mean the same thing, but in order to make a separation between the right to privacy and the right to data protection, the terms “Data Protection by Design and by Default” was chosen. In this way, A distinction was made between the two fundamental European rights, on the one hand, “The Right to Privacy” and on the other “The Right to Data Protection“, but “Privacy by Design and by Default” is still recognized as what the European Commission calls “an essential component for fundamental privacy protection” and „privacy engineering“.

„Privacy by Design“ is based on 7 foundational principles originally conceived by „Dr. Ann Cavoukian“ during the 90‘ when she was the „information and privacy commissioner for Ontario, Canada“. These 7 foundational principles are like a “manifesto” for “The General Data Protection Regulation” as they are the foundation for many of the articles. F.ex. for article 25 which is called „Data Protection by Design and by Default“

HashiCorp Vault is a tool that can help us to implement these principles. Especially in relation to data protection. F.ex: With HashiCorp Vault, you can ensure that private data only is exposed unencrypted after the user has been authenticated, authorized and audited for being allowed to decrypt the data. Which is a proactive approach to data protection as it assumes that the data can be stolen and protects the privacy of the data subject even if that were to happen. That also means that “by default” private data is inaccessible until proper trust has been established. Which is according to principle 2 in regards to “the default setting”.
HashiCorp Vault, therefore, helps us in implementing these principles, except perhaps for principle 6 in regards to „Visibility and Transparency“.
To ensure visibility and transparency when processing private data we need a way to keep track of changes to the private data as they happen over time. We should be able to see what operations are responsible for changing the private data and why.
A very interesting architectural pattern for achieving this is event sourcing.

When using traditional SQL or NoSQL databases we can query an application’s state, and this answers many questions.
However there are times when we not only just want to see where we are, we also want to know how we got there.
Event Sourcing ensures that all changes to the application’s state are stored as a sequence of events.
Meaning, that instead of storing the result of the computation we instead store the series of transactions that in sum represent the result.

Early 19th-century German ledger. RaphaelQS — Own work

Take as an example your bank account. Your bank account is not a secure box where you keep your money.
Your bank account is a principal book or computer file for recording and totaling economic transactions.
Each time you buy a t-shirt or take out cash from an ATM an event is registered as a transaction in your bank account, and based on the result of that transaction we get to know our balance which is the result of applying the sum of all transactions that has happened to your account.
Tracking our spending makes us capable of planning ahead to ensure we don’t spend more than we have, but what if someone has taken advantage and charged you unreasonably for a service?
Maintaining transaction logs makes sure everyone stays honest in regards to buying and selling as all transactions are transparently recorded.
So event sourcing can be a tool, not only for keeping track of state but also to ensure transparency and visibility into what is happening and what has happened to your private data.

But events are records from the past. In the same way, as we can not rewrite the history books, we can’t remove immutable events either, but according to GDPR article 17, the data subject has “the right to be forgotten” so we need to have mechanisms for deleting the data subject’s personal information. At the same time according to article 5, private data should also not be kept around “for longer than what is necessary”.
So how then do we handle the “right to be forgotten” and “data retention” when the data is immutable, undeletable and spread around in logs, backups, and different microservices.

So to address these challenges we can apply a technique called crypto-trashing.
Crypto-trashing is a technique were we encrypt the data and once we want to remove that data, instead of deleting the data itself, we delete the encryption key. Once the encryption key is deleted, private data consumed by other services, saved to event stores, backups or logs will become unreadable.
So how can we use this technique for implementing GDPR compliance? According to GDPR, we need to implement the user’s „right to be forgotten“. At the same time private data should not be stored for longer then what the data subject has agreed to according to our privacy policy. So let’s say I create an encryption key specific for each individual data subject. I call this key the PIIRef key. Then I create a temporal encryption key that I will use during a specific data retention period. Then I will use double encryption and encrypt the private data first using the personal PIIRef key and then the temporal key. So when an individual, for some specific reason, asks to get their private data deleted, I delete the PIIRef key making the data unreadable. If the data retention period pass, I create a snapshot of the data and encrypt it with the key for the next retention period, then I delete the key for the previous retention period, making all private data events created during the previous retention period unreadable.

Let‘s have a look at how this works.

I have installed HashiCorp Vault and an event store from AxonIQ called Axon server and created a microservice for registering health treatment for elderly patients in need of medical assistance at a home for the elderly. From Axon server, I have a full overview of all events that is generated by the microservice. As you can see. The event store is completely empty so I will go ahead and register some health treatments for two patients living at the home.

So I will subscribe 10 mg of Brintellix to Jan Johansen as he had become increasingly depressed the last year and 20 mg to his brother Stig Johansen as he has been depressed for many years now. As you can see our event store has started to fill up.

To better see that the data has been encrypted, I have created an endpoint for viewing the encrypted data and another endpoint showing how the data looks when it is unencrypted.

To encrypt the data we are using the Vault transit secret engine. The reason we are using this engine is for two reasons. Reason number 1, the vault transit secret engine exposes encryption as a service, meaning that it does the encryption and decryption for us without exposing the encryption key to the application. This means that even if an attacker were able to gain access to our service, he still wouldn‘t be able to decrypt the data as he wouldn‘t have access to the encryption key. The 2. reason is that the Vault transit secret engine supports key derivation, which allows the same key to be used for multiple purposes by deriving a new key based on a user-supplied context value. By making this context-specific for each person, I can encrypt and decrypt each data subject’s health treatments using a single encryption key. I then store this context value in a normal database. Now let’s say Jan‘s brother dies and his family requests the deletion of Stigs data. As I am using event sourcing I can‘t simply delete all his events. What if this data is being used for generating a report to the government that they need once a year. If I delete the data I might risk losing data needed for this report. Besides. Deleting data from the event store can be quite resource consuming, complicated to implement and risky in itself. And what about logs, backups, and copies of the data that has been spread to other microservices? So what I want to do instead is to delete the context value represented by the NINRef. So that is what I am going to do.

There, now, as you can see, the microservice only exposes Jan‘s health treatments and if I am going to the unencrypted data endpoint, I understand why as it is no longer possible to decrypt Stig‘s health treatments.

So, in this way we have with a very simple operation made sure that Stig is deleted, but we can still use his data for our reports as the product data is still there and completely anonymous. I have also made sure that Stig not only is deleted for this microservice. Stig’s data is also removed from any other microservice and any logs or backups which have processed his data as well.
For data retention, I would apply the same technique, but I would make sure I create a snapshot for all the data using the retention key for the new period before deleting the old key.
Please keep in mind that you should encrypt the PIIRef and not share it with external consumers other than the internal services you have control over. The reason for this is that the PIIRef is itself a personal identifier, but it’s ok to leave it undeleted as long as you only are using it between your internal services, but if external services were to use it, they could theoretically use it to retrieve the sensitive information that you told the data subject that you had deleted. That could potentially lead to a very awkward situation. You should, therefore, consider encrypting the PIIRef with a table/column/field key and never share the PIIRef as I have done in the examples above meant for demonstration purposes. You should obviously never use the value of the PIIRef itself as the user derived context value either. The PIIRef should point to a user derived context value which is unique for each service that will be encrypting your data. You, therefore, need to create an encryption service responsible for encrypting and decrypting your data as a wrapper service around the HashiCorp Vault transit secret engine or possibly a Vault secret engine plugin.

There is one more thing. How do I ensure that the data that the data-controller have flagged as personal data really is encrypted, how do I test that the correct data is being encrypted and how can I prove to an auditor that this encryption is being applied in production?
To be able to do this we use JSON schemas for data flow validation. Using JSON schema draft version 7 we can specify which fields we want to encrypt.
So I have created a very small library that takes a JSON Schema, the context value used for key derivation encryption and the object that should be encrypted.

Here I have the JSON schema that is used for encrypting the health treatment. The library will use the “contentEncoding” and “contentMediaType” properties from the draft version 7 of the JSON Schema standard to specify which of the fields will be encrypted. The library will then validate that encryption is being correctly applied, log a warning if something is wrong and log an info entry if everything is ok. Based on the log output I can test the encryption, monitor how encryption is being applied and do a data flow analysis for auditing purposes.
So why am I using a JSON Schema for this, why am I not just validating the input with code?
I want to be able to model how the data structure is encrypted in a modeling software like f.ex Sparks and export that data model as a JSON schema directly from the modeling software and into the code so I can have complete traceability from the design phase until the schema is used in production as this enables me to design and review the encryption of personal data for my microservices. This also makes me able to make some assumptions based on the solution design so that I can ensure when doing the data flow analysis, that private data is encrypted and protected by design and by default using HashiCorp Vault.

GDPR compliant event sourcingwith HashiCorp Vault

Written by Johan Sydseter

GDPR compliant event sourcing
with HashiCorp Vault