Grakn to manage GDPR

Samuel Pouyt
Vaticle
15 min readJan 18, 2018

--

Second part of a series. Read the first part to understand what GDPR is: a threat or an opportunity?

In my understanding, GDPR asks from company to track every piece of information that it requires from a user and to ask the user for permission per piece of information. Thus all the systems a company uses need to be able to request each and every needed authorization. That is not a small task.

The difficulty of tracking and managing all these authorisations is the reason I propose to use an external system — or a meta system — that all other softwares used in the company will have to refer to. Indeed these systems will have to verify if a user has granted permission to access the system, if not ask for it and record back that information in the meta system.

The idea is to make this as simple and flexible as possible and not to be cornered by choices that were made a few month ago. This is where graph database shine. In fact graph databases are flexible by design and it is really easy to add new relations (edges) between nodes (vertices) without doing a migration as we would have to in the RDBMS world. Searches are typically very fast especially across many relations.

But we want more, we want the benefit of a graph database, but we also want to be able to easily extract knowledge from the data we will be inserting in the graph database. We could have chosen OWL or RDF for that purpose, but these semantic web technologies are too verbose and–because they were not designed to be used as a database–not really up to the task at hand.

The solution I have found is Grakn, which is a distributed knowledge base. While it is not technically a graph database, it leverages the best technical strengths of graph databases, but also uses a schema to define its structure. The schema in Grakn is very simple to write and allows us to have data validation. Moreover, Grakn also has a built in inference engine, which will help us to simplify queries and provide basic recommendations.

Building an ontology

As mentioned before, GDPR requires data handlers/processors to track every piece of information. An ontology would usually be designed with an entity person that would have many attributes. You could have a entity man and another entity woman that inherit attributes from the person entity. This is the typical object inheritance concept. That works, but it it is not very practical from a GDPR point of view.

That is where Grakn can be used to leverage the opportunity of GDPR. The knowledge base lets us decompose user data into single items, which can then be individually granted an authorization in order to have this data used in a system. Inference rules can help simplify queries. Let start with entities.

Entities

User properties

I think it is better to decompose each of the attributes so that they were entities in their own right. Thus they will be able to take an active part in the graph relations. In Grakn attributes of an entity are declared has follow:

value sub attribute datatype string;

attributes is a reserved keyword. valueis the name of the attribute and we declare a datatypewhich is stringand that will be used to validate the incoming data. In the graph, my attributes will be entities; therefore, I declare an abstract entitynamed propertywhich is not a reserved keyword, and all my attributes (email, firstname, etc.) will be inheriting from this abstract entity. Grakn will prevent me, to create a property directly, I will only be able to create its children. This is a great feature of Grakn as it really helps the creation of schemas and their maintenance. Let see how the propertyentity looks like:

property sub entity is-abstract
has value
plays owned
plays demand
plays authorizer
plays exported
plays imported
plays revoker
plays withdrawer;

We can then declare our sub entities as one liners:

last-name sub property;
first-name sub property;
email sub property;

The advantage given by an abstract entity property is very clear here, all the roles and all the attribute(singular intentional) are declared once and inherited by all the sub entities. When it is required to add another relation, add it to the parent entity, and all sub entities will reflect the change.

The other advantage is that it simplifies querying for properties. I do not have to query each sub entity; instead, I can query the abstract entity, and get all the sub entities:

match
$a isa property;
$p isa person, has identifier 123456;
(exporter:$x, exported:$a, exported-to:$z);
(owned:$a, owner:$p);
get;

The above request will return all the properties that have been exported to a system, and by whom (i.e. which user), limiting this request to a specific owner of type person. But we will come back to those relations. What is important in this request is to understand that I can query the abstract property, to get all its children. The identifier’s value would of course be passed dynamically.

In the same manner, a sub entity of a sub entity can be declared in Grakn. It is, for example, very useful for addresses:

address sub property
has value;
city sub address;
zip sub address;
street1 sub address;
street2 sub address;
street3 sub address;

Here the address in not abstract, as I want to be able to give a value to that address. Typical values could be “home”, “professional”, “billing”, “delivery”. As before, I do want to be able to query all the children of an address at once, and get their value (singular intentional). The trade off here is that I did not declare has name for the address as this attribute would have been inherited by all its children. It does not make sense to have a zip name and a value, but I think that it is easy enough to remember that an address has a value (i.e. the name of the address).

Person entity

Now that we have decomposed all the properties of a person into individual nodes (or entities), we need a node in the graph to attach all these properties, that can be numerous, to one user, or, as I named it in my schema, one person. As this person has been deprived of all its properties, the person node is very simple:

person sub entity
has timestamp
has type
has identifier
plays identified
plays imported
plays importer
plays exported
plays exporter
plays owner;

It has three attributes. What the timestamp does is obvious, the type is the type of person that makes sense for the European Respiratory Society (ERS). The value could be any person linked to our organisation: members, non-members, or staff,. This raises an interesting point. GDPR is for users of your product — i.e. for your clients. But GDPR is also for your staff. They also have the right to be forgotten, and to recuperate data from your systems. The identifier is a unique identifier that makes sense for your data. At ERS this is an integer. But it could be a string or anything.

To have an identifier makes querying the person node easier. If not, you always have to check for a property and its relation with the person node you are trying to reach:

match 
$p isa person;
$e isa email, has value “member@test.com”;
($e, $p) isa belongs;
get $p;

Instead of:

match $p isa person has identifier 1; get;

Both queries work and return exactly the same result, but one is obviously shorter. I did not do any benchmarking but the first query should logically be slower as it does not access directly the information we are interested in. In the visualizer, there is no visible difference.

Before developing more on relations, let’s finish with the entities. I want to mention an important “trick”. The knowledge base is intended to have all our users, but also all of their interactions, and also all of our content (at least reference to our content). The idea is to add relations between a user and content, or between content and content, or between content and topics, or between a user and topics, etc. Doing this will actually enable the ERS to have good knowledge on its users, on its content, and on how all these things relate. Grakn therefore serves as the ideal base of our recommender system.

What happens when a user has to be deleted? Does the system become dumber? In fact, it will if I have a user connected to content items, or topics, etc. I will have to delete all those relations when a user asks to be deleted. If other users were suggested content based on that user’s behaviour, the recommender system would lose information.

The solution? Abstract the user away. How could this be down? I have added a new entity in the graph:

anonymous sub entity
has timestamp
plays incognito;

The idea is to connect all interactions with content and all the users behaviours (clicked, read, went to an event) to the anonymous node:

When a user requests his data to be deleted, we can delete everything that is on the right of the anonymous node. Anything that is connected to the anonymous node will stay in the graph. In my view this is a very good compromise. The anonymous node does not store any information about the person node. Thus when the relation between the anonymous node and the person one is deleted there is no way to know which node the person one was connected to; but, at the same time, all the data around the user stays. The deletion of the relation between the anonymous node and the person is, to my knowledge, irreversible. Unless the person who asks to be deleted is the first one– as we could query for all the anonymous nodes, then all those that have a relation identifies with person and filter them out. To avoid that we could create few fake anonymous nodes.

There is one caveat though. We saw in the previous post that few personal data points suffice to re-identify the person. If we record all the events the person went to, and then match that to the history of payment transactions, we should not be far away from identifying an individual. Legally, in Switzerland, we need to keep the accounting data for 10 years. This presents a kind of a paradox. Honestly I do not know how far one should go to prevent re-identification. This is a question for a lawyer and not for the basement guy in a hoody.

System entity

In order to make our schema complete we need to add other entities. This is because we need to know what authorizations were given and what systems are used in the company. I intentionally use the word system as it is vague enough. For example, in the case of ERS, systems would encompass our CRM, websites, apps, mailchimp but also any kind of export one could do, such as excel, csv, or why not JSON as well as external systems when you exchange data with partners, etc.

system sub entity
has value
has icon
plays importer
plays exporter
plays requester
plays authorized
plays exported-to
plays imported-to;

The icon is not necessary. We simply store a string that will allow us to display nice icons for each systems in the dashboard in order to make it more appealing.

Authorization entity

We finally need a last entity: the authorization. I thought it was better to have the authorization as an entity as we could have many authorizations per property. Mailchimp requires an authorization to use the email address to send the newsletter, but it could ask for additional authorizations to send advertisement, job listing, etc. The email address is also used in the CRM and authorizations could range from system mails, to the staff contacting the member.

Additionally authorizations need to be queried by the systems in order to display the list of data required from the user as well as displaying all the necessary authorizations.

Although the above graph does not display the system name, we can easily see from the authorization values that the system in question is sending emails… It requires three authorizations that are all related to emailing (the timestamp is just composed of fake numbers). The descriptions can be displayed to the users. In my understanding, this is required by GDPR, as it is not possible to ask general or vague authorizations–such as “your data will be used to improve your experience”. Modeling your schema in such manner is very granular. If you decide, or if your lawyers tell you that you do not have to be so detailed, this schema works as well, just connect more user properties to the same authorization, and do the same for systems.

One of the biggest advantages of using an external system is that all authorizations are managed from a central point; thus, you can update all your systems at once as they all query the central “authorization repository”. The request to get all authorizations is very simple:

match 
$a isa authorization;
$s isa system;
(requisite:$a, requester:$system);
get;

And it can easily be used to create an API endpoint where the only parameter needed is some data that specifies the system. But this is another story, and will be dealt with in a separate blog post.

We now have all our entities, you have already seen that they are of course all interconnected. We should now speak about the edges or the relations among all these entities.

Relations

The relations are very simple. They just connect nodes together. A particularity of Grakn is that it is a hypergraph. Thus we can connect not only one node to another but many nodes together.

To start the discussion let’s query the graph and go from an anonymous user, find the email, the authorization it requires and finally a system in which it is used.

match 
$p isa person has identifier 1;
$i isa anonymous;
$e isa email;
$a isa authorization;
$s isa system;
($i, $p) isa identifies;
($p, $e) isa belongs;
($e, $a) isa needs;
($a, $s) isa requires;
get;

Which gives the following result:

Actually, in this example, I cheated and clicked on the email node to show a hyper-relation that is among three nodes. We can see that the email has been exported by a person which has the id V65592 and that it was exported to some system.

The idea is to help the company track what is happening with the data. Here the exporter is a person but if you pay attention to the system entity definition. You will notice that a system can also play the role of exporter (or importer for that matter).

When users requests the deletion of their email, we know that each email is used in two systems. If it had been exported to excel, we would know who has that excel sheet and we could require from that person a confirmation that the file has been deleted, or the person in question was removed from it.

We have some relations that identifies the anonymous node. Each attribute has a belongs relation connecting it to its owner. Each attributes needs an authorization. Finally each system requires an authorization. Notice how all relation names is simply a third person singular verb and how the text can easily be translated in a schema.

Each relation needs at least two roles and each nodes it connects plays one or the other role. Let’s check the definition of the authorization entity to see all the roles it can play.

authorization sub entity
has name
has description
has timestamp
has expiration-date
plays needed
plays requisite
plays revoked
plays withdrawn;

As a side note, notice that there is a timestamp attribute but also an expiration-date. Let say you need to require an authorization to use the data for one year (the length of a membership), the expiration timestamp could already be calculated upon creation. A simple inference rule could compare the timestamp with the “now” timestamp and automatically revoke access to the data. As it not yet possible to get the “now” timestamp from the rules. A simple cron job could update the rule everyday in order to revoke authorizations that are too old.

Keeping data for a limited amount of time is actually a requirement of GDPR (there is an exception that seems broad about archiving and statistical purposes, but that’s beyond our discussion here). You need to collect data only if and when you need it and for the period of time you need it. Thus, it should be deleted when a user’s membership is deleted. Or, you could send an email to the user when his membership expires to ask him if he agrees to let you keep the data in order to simplify the renewal of his membership and to make sure the user will be able to find back his history at a later stage. It should be precisely explained how long you will keep the data. That process could be repeated when that authorization is over.

After this digression, in the above schema definition we can see that the authorization plays twice two similar roles: needed and requisite, revoked and withdrawn. The second in the pairs is used for inferences. As we will see below, we need those additional roles as they could be used at the same time but from different entities. Anyway, I thought it was making the schema clearer. Inference is part of the magic of Grakn and it is dynamic. We can change a relation and other relations will adapt automatically as they are inferred from our data.

Inferences

For now, in this ontology, I use inference as a way to simplify queries. Indeed, the previous query that we studied above can be simplified. When you want to check if an email can be used, it is not needed to know the details of the authorization, you just want to know if a relation exists between an attribute and a system, or if the existing relation exist: authorizes and if it is not of type withdraws. If the relation is of type withdraws, when the user tries to access the system, he can be notified that he has withdrawn the authorization and he could be asked to authorize that data again in order to access the system. Both relations are inferred relations. Let see first how the previous query could be simplified:

match 
$p isa person has identifier 1;
$i isa anonymous;
$e isa email;
$s isa system;
($i, $p) isa identifies;
($p, $e) isa belongs;
($e, $s) isa authorizes;
get;

If you did not erase the previous result in the visualiser and you have ticked the box activate inference in the setting of the visualizer you should see the inferred relation popup.

If the underlying data change–for example the needs relation is replaced by revoke— then the authorizes will be replaced by the withdraws inferred relations. You certainly have noticed that revoke does not use the third person singular. In the way I have defined the schema, revoke is a child of an abstract relationship action.

In fact, I have defined some relationships as action that can be taken on the data: import, export and revoke are actions, but this is just my understanding it could easily be modified to fit the third person way of structuring relationships. The only advantage I find is that I can can query action as we did with the property entity to easily find all the movements of data.

When we put everything together, here is the full schema. The visualizer displays very well the parent and children entities as well as relationships:

Now that we have a schema ready, we will load some data and create an API that will be the foundation of the dashboard, but also will be used by all systems to create, read, modify and delete data in the knowledge base.

The whole code is available in the following GitHub repository: https://github.com/idealley/grakn-gdpr you can clone it:

>>> git clone https://github.com/idealley/grakn-gdpr <a folder>

I assume that grakn is installed and running. Then you can use the script load.sh to load the ontology, rules, and some fake data (one user and one staff). Run the following command:

>>> ./load.sh grakn grakn

The first grakn is the relative pass from $HOME where Grakn is installed. The second one is the keyspace you want the data loaded in. It takes few seconds, then you will have the same schema as above.

The Data is here. In the next post, we will see how to interact with the data and start connecting the dots with an API.

--

--

Samuel Pouyt
Vaticle

Tech Lead/Software engineer. I am currently working on Legal Technologies and Computational Law. I enjoy opera, philosophy nature and literature.