A primer on GDPR — Summary from a talk at JavaZone (plus some notes…)

Aiko Yamashita
aikochama
Published in
10 min readOct 20, 2017

--

With a looming deadline of May 2018, companies have less than 7 months to adhere to the new EU regulation before they could face fines of up to 4% of global revenues or €20 million (whichever is highest) for noncompliance, the GDPR (General Data Protection Regulation) is in the mouth of everyone in the Tech industry (basically in any company that holds personal data).

The impact of this regulation is HUGE and the vast majority of companies are just now (very slowly, in my view) starting to realise this. There is a perfect meme for that, and I’m wondering why is not going viral…

And to make matters worse, it is the perfect nightmare for tech people, because it involves terms like: regulations, rules, law… (and let’s be honest here, when was the last time you even checked the license agreement for a library you are using in your project?). But the truth is that:

The GDPR regulations are tremendously intertwined with technology, organisational aspects and documentation. And at the bottom of all of that, there is data..

So, what is GDPR about? I found a nice talk (in Norwegian) given by Simen Sommerfeldt at this year’s JavaZone conference in Oslo. This talk offered a nice overview, so here’s a summary, spiced up with some personal side-notes.

Disclaimer: this article is based on my interpretation of the talk, and by no means can be considered a 100% faithful translation. Ok, let’s get started..

Terms that are important:

Simen started by pointing at some terms that we cannot obviate, and that organisations should know and understand, such as: Protected health information (PHI), Personally identifiable information (PII), Payment Card Industry (PCI), Data Management Agreement (Databehandleravtaler in Norwegian), Conformity (Hjemmel) and E-Privacy (all these good to keep in your list of “stuff I need to know about”).

User Agreement should be human-readable:

One of the most salient aspects of GDPR is that agreements placed forward to users should be understandable. So, basically confounding the user with lengthy and jargon loaded mumbo-jumbo is a no-go when it comes to GDPR.

Rule specifics:

Simen goes on by discussing the core parts of the regulation that touches technology. He points at several articles, namely: 7, 15, 16, 17, 19, 20, 25 and 32–35 (no worries, we’ll go through them very briefly).

Article 25: Data privacy by design and by default

This article indicates that privacy should be considered in every single aspect of the design and development of a product or service, and that when prompted by an option, the default should always be private mode. Some useful checkpoints to comply with this are:

  • Are you using only the minimum data needed?
  • Can you unlink data from individuals?
  • Are you using anonymisation and pseudonymization?

Simen suggests the usage of DPIA (Digital Privacy Impact Assessment) if a privacy breach in a specific context involves high risks and consequences.

Article 7: Conditions for consent

GDRP stipulates that the company needs to document or prove that consent has occurred between the user and the company, and if the consent is withdrawn, it should have an immediate effect.

Article 15: Right of access by the data subject

Another important aspect is that companies should provide users with access and an overview of their personal data. This can be done via something like an Information Portal, which could also be used for consent management purposes.

Note by Aiko: You can also check an interesting article at NYTimes written by this years’ Nobel Prize laureate Richard Thaler on exactly this topic.

Article 16: Right to rectification

The Information Portal can further be used so the user can amend his/her personal data. This implies a series of challenges such as validation of the new data, and potential ripple-effects” in the systems relying on the data.

Article 17: Right to be forgotten

Should we use anonymisation or pseudonymisation? it depends on the case! Simen goes by saying that for example, for tax audit purposes, it is required to hold on to certain data for 2.5 years before it can be deleted.

Article 19: Notification obligation

Notifications need to be followed by any changes requested by the registered individuals, that would be: change of personal data, request for removal or withdrawal of consent.

Article 20: Right to data portability

This means that it should be possible to consolidate and download all the data concerning an individual.

Note by Aiko: In some cases, it can be quite scary to see the data they have on you.. as shown in this article written by The Guardian on Tinder data.

Article 32–35: Security of processing

Techniques and an organisational strategy need to be in place for Data Access Management. This includes secure programming, and the mechanisms need to be in accordance with the disclosure risk. Also, Simen suggests to NOT put all the eggs in the same basket! — in case a system is compromised. Also, mechanisms for predicting risk or flagging unauthorised accesses are not a bad idea!

Anonymisation and pseudonymisation:

There seems to be some confusion involving these terms. According to Simen, there are three categories of data:

  • Personal data (the data is connected to an individual)
  • Pseudonymised data (the connection between the data and the individual is “hidden” and is recoverable)
  • Anonymised data (the connection between the data and the individual is destroyed and is unrecoverable).

GDPR stipulates that the level of consent, the level of notification, length of data retention, etc.. depends on how “easy” it is to connect the data to a single individual (in simple terms, it depends on if the data falls into any of the above three categories). Simen refers to an article by Mike Hintze: “Viewing the GDPR through a De-Identification Lens: A Tool for Compliance, Clarification, and Consistencyfor assessing the different cases.

Table by Mike Hintze in “Viewing the GDPR through a De-Identification Lens: A Tool for Compliance, Clarification, and Consistency

Note by Aiko: I think Anonymous/Aggregated data seems to lay in a “gray” zone, given that there are different ways of anonymising that can still lead to de-identification, see for example this article in KDNuggets.

Techniques for anonymisation and pseudonymisation:

Here’s a super-summarized version of what Simen said:

  • Tokenisation: Replace data for a “token”. Often reversible operation
  • Hashing: A mapping algorithm that takes a string of characters into a shorter fixed-length value or key
  • Noise addition: ‘tweak’ values e.g., go from 72kg to 74,5kg
  • Substitution: replace the value by completely different values (combined with noise addition)
  • Aggregation: Normally should follow a certain k-anonymity level
  • K anonymity means that a given “row” is indistinguishable within at least k-1 other rows in the dataset, and it requires to remove data if k-anonymity cannot be guaranteed
  • Generalisation: For example, instead of having Age: 23, replace by Ages: 20–30.

L-diversity:

From K-Anonymity, Simen goes to L-Diversity, by introducing the notion of Inference Attacks (or interference attacks, although in academic papers is mostly referred as inference so watch out ;) and suggests the usage of L-diversity. This technique protects anonymity by giving every attribute at least L different values.

Differential privacy:

This technique basically consists of adding noise (with a certain statistical distribution) to the data . An analyst that knows the particular distribution can remove the noise and use the data for analysis purposes.

Note by Aiko: Apple boasts that they use differential privacy, although they don’t disclose much info on their particular implementation. Google uses an approach called RAPPOR (Randomized Aggregatable Privacy-Preserving Ordinal Response) for collecting data from Chrome, which is said to guarantee differential privacy.

And exactly when I was thinking of “de-identification” Simen mentioned an article by WSGR on the de-identification risks for each of the different techniques.. here’s the table:

Table by WSGR

Ok, so what should we do?

Simen stressed that it is important to perform a DPIA for all sensitive data, whenever there are high risks for personal consequences. And this needs to be properly documented.

There should be a personal data flow model, consistent with the business architecture, that defines how the data is going to be handled in each of the systems, and the reasons behind each of the data treatments. An internal control system should be in place that can model and control access: type, object, actor and even the reason (the grounds on which the data is being accessed).

What to do at organisational level:

GDRP traverses many boundaries within an organisation, so a coordinated effort is needed. Some useful questions/checkpoints are:

  • Is there a domain regulation applicable? (finance, healthcare, etc)
  • Are any other laws involved? (in Norway, Datatilsynet has adapted the national privacy regulation to be compliant with GDPR)
  • IT operations, lawyers, security and developers should be involved
  • Pedagogical initiatives to train the staff should be undertaken

Privacy, in the same way as security, should be “embedded” or should be a cross-cutting concern. At operational levels, GDPR is dependent of the security policies! (e.g., ISO 27000), and both privacy and security measures need to be described in the internal control system. The individual projects within the organisation are dependent of this overall framework, since each project should in principle have a DPIA that takes both privacy and security into account.

What to do at project level:

The main thing to do at project level is to make sure you answer yes to this question: “Are the users’ rights upheld in our system?” and perform a DPIA that includes:

  • Mapping of values (personal data)
  • Threat analysis
  • Are the values vulnerable to threats?
  • What are the consequences? what is the tolerance level?
  • What are the measures to reduce the risk?

Building staff competency:

  • Team members should get a basic introduction to privacy (in particular on security and privacy elements relevant to the development process)
  • The team needs to think of system-level design for privacy
  • Programmers need to code with security in mind (e.g., OWASP)
  • Security architecture, mechanisms and infrastructure need to be in place
  • UX designers need to know what is allowed or not according to GDPR
  • The team needs to define/coordinate how integration should be handled

Test data will be regulated:

Yes, this is bad news.. Simen says synthetic data is preferable, but many times corner cases are not captured by those. Though it is possible to scrape data for testing as long as you have consent. For that you need to explain the reason (and provide info on the data flow and disposal procedures). And btw.. disclosure agreement is bare a minimum to have in place (very important when working with e.g., outsourcing).

What to do at the technology/infrastructure level:

It is important to assess: what is the minimum data/information needed so that things work? since data should be handled on a “need to know” basis. Also:

  • Take control over the whole stack (at each level, there are risks for leaking of personal data)
  • Consider infrastructure and system-support for managing security-incidents
  • Anchor the development on security policies
  • Security should be tested in many dimensions like penetration tests, etc.
  • Separate the implementation of GDPR policies from the systems (for example, a GDPR API?)
  • Potential tools Simen mentioned are: ARX — Data Anonymization Tool, ardoq, and sesam.
  • A security surveillance system needs to be in place for data access.

What to document for compliance?

In Norway, Datatilsynet has guidelines for documenting compliance, but in essence, this is what you should document:

  • Why the system needs the data they manage
  • How long the system needs to retain the data
  • What (design) measures are in place to protect the data
  • Trade-offs and decisions that need to be made to be compliant
  • What processes/routines you follow to protect user’s interests
  • What processes/routines should be followed in case of incidents

GDRP and Machine Learning/AI

Machine Learning and other data analytics techniques can involve biases, thus GDPR stipulates that the user should have the right to see the grounds for the decision and algorithms used to arrive to it. An important aspect here is that an individual can reject the results from an automated decision support system, and request a human in the loop, if the consequences are serious (e.g., when processing a case at NAV, you can reject the results of an automated consultation decision and ask the case to be handled by an advisor).

Finally, one of the questions of the audience asked about the effect of GDPR on Google and Facebook. Simen says that Facebook and Google are the ones that will survive due to having a direct contract with the users and controlling the whole information ecosystem.

Note by Aiko: I tend to slightly disagree with Simen on this. No doubt that Google and Facebook will do ok, but GDPR will definitely affect their business models, since the rule applies to personal data of any European resident. There is a really interesting article by PageFair on how the different rules will affect the different services offered by these two tech giants.

TL;DR

So, winter is coming.. we’d better start moving towards aligning tech, organization, and documentation (and lawyers) to prepare for this.

GDPR basically stipulates that:

  • Users should be asked for consent explicitly, in a human-readable way
  • Users should be able to withdraw their consent at any moment and that should have an immediate effect
  • Users should be able to view/ament/download their own data
  • Security and Privacy should be upheld to protect sensitive data

What to do?

  • Perform a DPIA, whenever there are high risks for personal consequences
  • A personal data flow model should be described (and implemented): what happens with the data, where, by whom and why?
  • A coordinated effort across the organization is needed, including appropriate training
  • Infrastructure and system-support for managing security-incidents as well as access control should be in place.

That’s all for now! Until the next blogpost.. :)

--

--

Aiko Yamashita
aikochama

Researcher, engineer, data scientist, educator, explorer..