What is what? clarifying terminologies for data privacy

Published in

aikochama

7 min readJan 5, 2018

In a previous blog entry, I wrote a summary from a talk on GDPR at JavaZone, which mentions a couple of things about anonymisation and how is relevant for organisations to be compliant with GDPR. But the more I speak with people on GDPR issues, the more I realise there is quite some confusion when it comes to terminologies. What is masking? what is anonymising? pseudonymising? hashing? encrypting? what is the difference, and how/when we should use which one?

Nice depiction of “Jungle of Terminologies” by [1]

So, I decided it could be a good idea to write a concise list of terms and definitions of techniques relevant to management of private data, so we avoid having the same (cute) expression of confusion like the photo above.

Anonymisation vs. Pseudonimisation

To anonymise is to permanently destruct identifiable data
To pseudonymise is to substitute identifiable data with a reversible, consistent value.

So, next time you think of anonymisation, think of this box:

What is Tokenization?

In the financial domain, people speak more often about Tokenisation, rather than pseudonimisation. Tokenisation is:

“The process of consistently replacing sensitive elements, such as credit cards numbers, with non-sensitive surrogate values or tokens”

And is mostly done over PAN (Primary Account Number).

Remember also how devaluated tokens were? You needed ridiculous amounts of tokens to get a toy or a stuffed animal..

Remember when you used to play in one of these arcade game centres, and you would get tokens instead of money to play? well, is pretty much the same.

What I like most about the term Tokenisation, is that is pretty much an “umbrella” term covering both ano- and pseudo- terms, by providing a two-level classification:

1. Reversible or De-tokenizable (Pseudonymisation)

Cryptographic: In Cipher systems, values are transformed through the use of a set of unchanging rules or steps called a Cryptographic algorithm and a set of variable Cryptographic keys.
Non-cryptographic: Code systems that rely on codebooks to transform plaintext into code text.

2. Irreversible (Anonymisation)

Authenticatable: An authenticatable irreversible token is created mathematically through a one-way function that could be used to verify that a given PAN was used, but cannot be reversed to de-tokenize for the PAN.
Non-authenticatable: Irreversible tokens that are not authenticatable represent little to no risk for the disclosure of PAN. For instance, they can never be linked to a specific PAN, but they may be linked to a customer or account within the merchant.

Technical approaches to Tokenisation

There are mainly three ways to implement tokenisation:

Vault-based tokenisation: which uses a large database table to create lookup pairs that associate a token with the encrypted sensitive information

Vault-less tokenisation: where a token may be generated using the original data and a secret key or parameter that allows calculation of the data with the secret key and the token. Vault-less tokenisation does not require a database to store key value pairs, reducing the time required to complete a transaction that requires PAN recovery.

Stateless tokenisation: this one is a middle solution between Vault-based and Vault-less. It does not require to generate a database with lookup pairs, but the tokens are generated from a pre-defined, “static database” with pre-generated random values.

For further discussion on trade-offs between vault-based and vault-less, check out this article.

Hashing vs. Encrypting

This one is often a cause of confusion and discussions, and within reason, since both concepts are quite intertwined..

Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string

Hashing is used for different things, like:

To index and retrieve items in a database.
Used as part of encryption algorithms (i.e. like a digital signature)
Used for different tokenisation strategies

The hashing algorithm is called the hash function, and when it produces the same hash value from two different inputs, then we say that there is a collision problem with that function.

Ok, but what it is the difference between encryption and hashing?

Hashing is a one-way operation, so (should) not possible to “reverse engineer” the hash function by analyzing the hashed values. Oppositely, encryption is a two-ways operation, that includes encryption and decryption.

Visual explanation of the difference between hashing and encrypting by [2]

I think one of the main sources of confusion about hashing and encryption is the HMAC (Hash-based message authentication code), which is one of the key components in a cryptographic implementation, for verifying message integrity and authenticity. But then again, keep in mind that a cryptographic hash function is a special class of hash function that has certain properties which make it suitable for use in cryptography, and is not a cryptographic method per se. Some resources in Python for hashing are: Hmac hashing library in Python and Hashlib module in Python.

Ok, let’s dive more into details for Encryption… As mentioned before, encryption is used for reversible tokenisation, and for secure sending messages through the Internet or network. Data that has not been encrypted is known as plain text while encrypting data is known as a cipher text.

What types of encryptions exist?

Symmetric encryption — Uses the same secret key to encrypt and decrypt the message.
Asymmetric encryption — It deploys two keys, a public key known by everyone and a private key known only by the receiver. Asymmetric encryption is slower than symmetric encryption and uses more processing power when encrypting.
Hybrid encryption — It is a process of encryption that blends both symmetric and asymmetric encryption. It takes advantage of the strengths of the two encryptions and minimises their weakness.

What algorithms exist for encryption?

There is a plethora of algorithms, but here are some of the most famous ones:

AES (Advanced Encryption Standard) — Asymmetric
PGP (Pretty Good Privacy) — Symmetric
Twofish — Symmetric, good performance in SW/HW
RSA + AES — A hybrid solution

What is Dynamic Masking?

There is a lot of discussion on dynamic masking, so I thought it would be good to touch upon the topic.. Basically a solution that implements dynamic masking changes the data stream so that the data requester does not get access to the sensitive data. For doing so, policies can be established to return an entire field tokenised or dynamically mask parts of a field in real-time depending of who is the data requester.

The important thing to keep in mind is that no physical changes to the original production data take place, and normally, there is no need of complex solutions (hashing, etc) because dummy values are used.

There is full masking and partial masking, as well as a random mask for numeric data, and most applications can mask sensitive data without modifying existing queries. Some further reading in this topic can be found here and here.

Questions you need to answer before going for one option or another

Directions, directions, directions…(image by [3])

When to use anonymisation? when to use masking? when to use what? well… that depends of what your data privacy needs are. Here are a couple of questions you need to ask yourself (or the organisation you are working with), to help you deciding which techniques/options to go for:

Q1: Do you need to release data in the cloud?

Uhh, that’s heavy duty (at least from GDPR compliance perspective), in which case, you may even consider using RAPPOR, for example.

Q2: Do you need to modify the original source or just control the level and access of data?

If you are allowed to retain the data, dynamic masking and/or dynamic authentication are easily implemented in Netezza or in standard Microsoft solutions.

Q3: Should the tokenisation be reversible or not?

If you need to recover the original data, you need pseudonimisation solutions

Q4: How “safe” or “strong” should the reversible tokenisation be? (for example, for GDPR compliance)

Stronger requirements require more sophisticated solutions, for example encryption, and weaker requirements can be implemented via codebooks or lookup tables.

Q5: I’m anonymising data, but do I need to verify that a given (original) value was used for a token?

If yes, you definitively need authenticatable anonymisation technique (e.g., hashing). If you don’t need to verify anything, a non-authenticatable (e.g. dummy-masking) would do just fine.

Q6: Do I have any performance/scalability constraints that need to be observed?

If so, you may decide upon vault-less tokenisation or stateless tokenisation instead of vault-based. Also, some encryptions are more costly than others (symmetrical, asymmetrical or hybrid?) so that is another aspect to take into account.

Q7: Do you need to preserve the data type after encryption?

If so, you may consider for example Format-Preserving Encryption.

In case you get lost amongst all the questions, here is a decision diagram, just to get you started..

Here are some useful resources in Python for private data management:

How to clean data in Python (the very first step, I would say..)
Techniques to anonymise data in Python
Practical guide to anonymising with faker and faker documentation
Anonymisation library in Python

Understand your privacy requirements…

The bottom line here is that only after you have understood the differences between the different concepts and techniques, and after identifying the data transformation needs, is that we can make an informed decision on how to treat sensitive data. As a rule of thumb, always make sure to follow these three steps:

Identify your sensitive data (know where, what, how, etc..)
Identify/Map down your data transformation needs (i.e., privacy requirements).
Choose your algorithms/solutions in close cooperation with business stakeholders and data governance experts.

I hope you found this article helpful on clarifying and consolidating some of the concepts that are essential for managing data privacy. If so, please clap and follow our data science blog posts!