Protecting Sensitive Data in Analytics: A Data Engineering Perspective

Published in

QuantumBlack, AI by McKinsey

10 min readFeb 17, 2023

Our team has shared the most effective ways to keep data safe, including key techniques such as tokenisation, suppression and cryptographic encryption.

**This image was generated with the assistance of AI.**

Data-driven solutions help organisations make better decisions, improve efficiency, create better experiences for customers and ultimately bring in more revenue. But the growth of big data is outpacing the protection of such information. With the ever-increasing amount of data being collected, stored and processed, it is essential for data engineers to understand how best to handle personal information for analytics.

Data engineers frequently spend their days striking a balance between two responsibilities: Harnessing large amounts of data involving sensitive/ personal data to innovate and drive change while also adhering to strict standards that govern how that data should be handled and used.

The first responsibility is not possible without the second. As a result, many data privacy-enhancing techniques such as anonymisation, pseudonymisation, synthetic data generation, differential privacy and hybrid strategies to de-identify personal data have grown in popularity.

In this article, we will discuss some of the important strategies aimed at dealing with data privacy that minimise the risk of disclosure. We will primarily focus on simple techniques like suppression, format preserving tokenization, cryptographic hashing, binning and perturbation.

Differences between pseudonymisation and anonymisation

Data identifiability can be compared to a spectrum of visibility. At one end, data is completely visible, meaning it can be used to identify an individual. At the other, data is completely invisible, anonymising the details. Anonymisation and pseudonymisation are two processes that can shift the visibility of data along this spectrum, allowing organisations to protect the privacy of individuals while still allowing the data to be used for analytics and other purposes.

Keeping this spectrum in mind, we can now define pseudonymisation as the process of anonymising a dataset so no data subject can be individually identified without use of additional information. However the data can still be used in analytics projects. The main advantage is that there is a balance between protecting an individual’s privacy and allowing the data to be useful. However, it is important to keep in mind that pseudonymised data can still be re-identified, if an attacker has access to the linking key.

Pseudo-anonymising a large dataset is an especially complex process and the data attributes need to be clearly labeled as personally identifiable information in the data dictionary and signed off by the owner, even before data sharing. It is also practically difficult to validate consent stating the exact purpose of how the data will be used. Therefore we minimise the data used for analytics by limiting it to critical elements needed to prove the model hypothesis. Then we anonymise the data to the level that the subjects cannot be re-identified and linked to an individual by the user without the use of additional data that is stored under the custody of the owner.

Pro tip: Adding data privacy metrics like sensitivity and usefulness of each data attribute to a data dictionary is recommended for enterprise-level sustained governance. This allows you to track if the data is being utilised for the relevant and agreed-upon reasons with the right level of sensitivity.

Personal data attributes

Finding personal identifiers in a given dataset before using it for analytics is an important step in protecting individuals’ privacy. There are two main approaches to identifying personal identifiers:

Direct identifiers: This is personal data that can be used to identify an individual without any additional information. Examples include: name, address, email, phone number, passport number and driver’s license number.

Indirect identifiers: This is personal data that can be combined with other information to identify an individual. Examples include: date of birth, gender, race, and occupation.

To find personal identifiers in a dataset, you can use a combination of automated tools and manual review:

Automated tools: There are several options on the market to auto-identify personal identifiers information (PII) columns for large datasets. However, it’s vital to add additional profiling checks as wrappers to these AI-assisted tools to prevent any accidental omissions or misidentification of PII columns. For example, in Google Cloud, DLP service can be used to identify identifiers by scanning the data for specific patterns, auto-classify personal data and handle them accordingly

Manual review: Even with automated tools, it is important to manually review the dataset to identify any personal identifiers that may have been missed. This can be done by reviewing the data elements one by one and checking if they match any of the examples.

After recognising the personal identifiers, it is important to use techniques such as anonymisation, pseudonymisation, synthetic data generation, differential privacy and hybrid strategies to de-identify the data before using it for analytics. These techniques can be used to remove, mask or obscure sensitive information, while still preserving the data’s analytic utility.

Four privacy-enhancing techniques

1. Data suppression:

Data suppression is a technique used to de-identify personal data by removing or masking certain information. It is often used to remove direct identifiers such as names, addresses and phone numbers. As well as indirect identifiers such as date of birth and gender. You can also suppress specific rows for which customer’s consent isn’t available.

Key considerations: Implementation is simple, fully anonymised, doesn’t need analytics and is possible on suppressed personal or sensitive data and re-identification is not possible.

Here’s a simple example of how data suppression could be used in Python to de-identify personal data:

2. Data tokenisation:

Data tokenisation is the process of replacing a single piece of sensitive data with a non-sensitive random string of characters, often called a token. Tokens serve as reference to the original data, but cannot be used to guess those values. That’s because, unlike encryption, tokenisation does not use a mathematical process to transform the sensitive information into the token. There is no key or algorithm that can be used to derive the original data for a token. Instead, tokenisation uses a database or secured file storage, called a token vault, which stores the relationship between the sensitive value and the token. The real data in the vault is then secured, often via encryption.

For both table-based and file-based tokenisation, you can apply format preserving tokenisation, a technique that keeps the format and length of the original data while replacing it with a unique token.

The token value can be used to support business operations after a machine learning (ML) model is operationalised. If the real data needs to be retrieved. For example, in the case of identifying the actual email address for actioning on cross-sell predictions, the token is submitted to the vault and the index is used to fetch the customer email address for use in the authorisation process. To the end user, this operation is performed seamlessly by the browser or application nearly instantaneously. They’re likely not even aware that the data is stored in the cloud in a different format.

Key considerations: There is no mathematical relationship between the format preserving token and the real data, therefore there’s no risk in adopting for analytics. As long as the vault is highly secured.

3. Secure-keyed cryptographic hashing and encryption:

Cryptographic hashing and encryption transformations are de-identification methods that replace the original sensitive data values with encrypted or hashed values. Some of the key techniques here, include:

Secure-keyed cryptographic hashing is a method that involves creating a cryptographic hash of an input string using a secret key, similar to HMAC, and is generally considered a more secure approach as opposed to using only a hash function. Unique identifiers with primary key behaviour are de-identified this way for very large datasets.
Format preserving encryption (FPE) is an encryption algorithm which preserves the format of the information while it is being encrypted. It involves replacing an input value with an encrypted value that has been generated using format-preserving encryption. Here’s a simple example of how FPE could be used in Python to de-identify personal data (CCN):

Deterministic encryption scheme is a cryptosystem which always produces the same ciphertext for a given plaintext and key, even over separate executions of the encryption algorithm. It replaces an input value with a token that has been generated using AES in Synthetic Initialization Vector mode (AES-SIV).

Key considerations: These techniques rely heavily on the usage of secure keys in order for them to be effective, therefore they are mostly deployed in secure and restricted analytics ecosystems with capabilities for storing keys safely.

Useful techniques in Python: pyfpe, cryptography, pycryptodome are all python libraries that can be used for implementing the above techniques

4. Data generalisation:

Data generalisation involves sorting sensitive columns into bins or groups for analysis, removing specifics and creating a more generalised view. This technique, in combination with other pseudonymisation methods, is particularly effective for large datasets.

Binning:

Numerical binning: In this method the numerical data is first sorted and then the sorted values are distributed into a number of buckets or bins. It is also known as bucketing or discretisation. For example, instead of showing a person’s exact age, the data could be grouped into age ranges (for example 18–30, 30–65) when you need develop features for churn prediction model

Useful techniques in Python: Pandas, OptBinning. For example, you can use Pandas functions such as qcut, cut for quintile-based binning.

Categorical recoding:

By grouping similar categorical data points together, categorical recording or binning reduces the granularity of the data. This can be done by creating broader categories or replacing specific categories with more general ones.

For example, let’s say you have a customer demographics dataset that contains information about people’s occupation. The original data might have categories such as; ‘teacher’, ‘nurse’, ‘engineer’ and ‘architect’. To protect the privacy of individuals, you could use categorical recoding to group similar professions together. For example, you could group ‘teacher’ and ‘nurse’ into a broader category called ‘education and healthcare’ and group ‘engineer’ and ‘architect’ into a broader category called ‘construction and design’.

Another example could be, in the case of location data, instead of showing the exact address, we could group the data into larger areas like city, state or region.

Key considerations: Use categorical binning when you need to perform feature engineering to generalise customer information as per model training requirements.

Useful techniques in Python: In Pandas , you can leverage replace() and map() functions, which can be used to replace specific categories with more general ones

5. Data perturbation

Data perturbation is a technique used to protect the privacy of individuals in a dataset by adding random noise to the data. This makes it difficult for an attacker to infer sensitive information about individual records in the dataset, while still allowing for meaningful analysis of the overall trends and patterns in the data.

Here’s an example of data perturbation in Python:

The techniques that need to be used to pseudo-anonymise a dataset is highly dependent on each individual use-case, type of PII in the datasets and in the environment that the data is stored.

Apart from the techniques discussed above, there are also other emerging and advanced data privacy-enhancing techniques like AI enabled synthetic data generation, federated learning, secure multi-party computation, homomorphic encryption, anonymising PII in unstructured data like images, PDFs using generative models for machine learning which are still under extensive research and something to keep an eye on.

This then gives rise to an important question that all data practitioners may need to consider — how to check if the data is pseudo-anonymised post application? At each stage of the data pipeline, it is extremely crucial to set up additional tests to reduce the risk of re-identification. Statistical methodologies developed by researchers such as t-Closeness, l-Diversity and k-Anonymity can also help with this by providing thresholds and values that can help estimate re-identifiability and in turn help understand to what degree the dataset has been pseudo-anonymised.

Privacy engineering is a difficult problem in analytics and but it will only continue to grow in significance for analytics projects as an increasing amount of data is captured. As data practitioners, we will play a crucial role in building the infrastructure that ensures this ever-growing mountain of data can be utilised effectively — while also being stored and deployed responsibly.

We hope this article has been useful for understanding how best we can protect data privacy in analytics whilst also ensuring model effectiveness results remain unaffected by these measures taken — feel free share your thoughts below regarding what approaches work best given certain scenarios.