Understanding Data Masking Techniques in Talend

Published in

Data Engineering Indonesia

8 min readAug 6, 2023

Introduction

Data has transformed from a mere description of information and transactions into a valuable asset essential in various aspects of life. As its value increases, data has become susceptible to theft, misuse, and exploitation, causing many to experience the repercussions of data breaches. To strike a balance between safeguarding data contents and ensuring seamless flow across systems, the solution lies in data masking. Data masking is a term for various techniques and strategies used to protect confidential or sensitive information while preserving data utility.

Now, we will explore different methods and strategies used in Data Masking and understand how they play a pivotal role in securing valuable data assets.

Data Masking Types

Static Data Masking

Static Data Masking is a technique to protect sensitive information by permanently changing it in a dataset. It creates a safe copy of the data that can be shared or used for testing without revealing the original sensitive details. Unlike dynamic masking, the changes are fixed and do not change over time. It’s a reliable way to keep data private and secure.

Dynamic Data Masking

Dynamic Data Masking is a data protection method that applies real-time masking to sensitive information when accessed or viewed by authorized users. Authorized users can see the original data, but dynamic masking hides sensitive parts of the data based on their access privileges. Dynamic Data Masking is particularly useful in situations where real-time data access is required, and data privacy needs to be maintained without creating separate masked copies of the data.

Data Masking Techniques

Substitution

Substitution involves replacing real data with different (yet authentic-looking) data of the same type. This technique may utilize two separate databases (source data and a lookup/reference database). It can be helpful for values that typically require validation, such as credit card numbers, where completely randomized values may not pass specific validation tests.

Encryption

Another popular method uses masking algorithms to replace personal data with random strings known as ciphertext. In most cases, data decryption is only possible for individuals with access to the encryption algorithm and cryptographic keys used to encode it. This technique is useful for securely storing and transferring data but may be less suitable for testing environments that require more realistic data.

Shuffling

A simple method to obfuscate data involves taking characters within a string and randomly rearranging them (e.g., “Smith” becomes “Tmhsi”). While not the most secure or sophisticated data masking method, it can be effective for certain data types, particularly in conjunction with more advanced data masking techniques for other fields.

Redaction

The simplest technique among all data anonymization methods involves completely overwriting the source data with generic characters. Under this method, “Smith” would become “XXXXX”. This method is quick, easy, and very secure but has limited applications in testing and is not reversible.

Data Masking Behaviors

Random Data Masking

Masking values by generating random output. Consequently, when multiple identical values exist in the input data, they can be masked to different values, and vice versa. This approach is suitable for preserving data format without complicating the relationship between original and masked values.

Consistent Data Masking

When the same value appears twice in the input data, this approach produces the same masked value. However, two different input values can be replaced with the same masked value. It serves as a preliminary step before using Bijective Data Masking.

Bijective Data Masking

Bijective Data Masking is characterized by the following key points:

Consistent Masking Function: The masking process remains stable and predictable, ensuring that the same input value consistently produces the same masked output. This feature is essential for maintaining data consistency and traceability.
Unique Masked Values: Each distinct input value always results in a different masked value. This one-to-one mapping between input and masked values ensures data integrity and prevents information loss during the masking process.

Bijective Data Masking is particularly useful when preserving a one-to-one relationship between original and masked values is crucial. It becomes essential when combining or searching multiple datasets that have undergone the masking process.

Repeatable Data Masking

Repeatable Data Masking is an approach that ensures consistent execution in Jobs by using a defined seed. This means that the same masked values will always be generated in the output. When combined with Bijective Data Masking, it becomes powerful because it guarantees that the masked data remains exactly the same, no matter where it is used.

Format-Preserving Encryption (FPE)

Format-Preserving Encryption (FPE) is a cryptographic technique that allows encrypting data while preserving its original format. The main advantage of FPE is that the encrypted data can still be decrypted, known as “Unmasking,” using the correct password or secret key. Additionally, FPE ensures that the encryption process maintains a one-to-one mapping between original and encrypted values (bijectivity) and produces the same encrypted value for the same input (repeatability). To achieve this, FPE requires a specific secret key or password to generate unique masked values.

Understanding Talend

Talend is a powerful data integration tool that simplifies complex data tasks. Talend Data Integration provides an intuitive platform for data extraction, transformation, and loading (ETL). It offers a wide range of features and capabilities to cater to various data management needs, including data masking, which ensures the protection of sensitive information while maintaining data utility.

Talend Data Fabric — Comprehensive integration solution

Talend for Data Masking

Talend offers various data masking techniques like substitution, encryption, and shuffling to cater to different privacy requirements. By incorporating data masking into data integration workflows, Talend ensures the security and privacy of data across various processes and systems.

Data Masking Project Demonstration

Project Description

In this demo project, I will showcase a simple implementation of Data Masking and Data Unmasking techniques in Talend with MongoDB as the data source and destination. The job illustrations for this project are shown in the images below.

Project Steps

Create a Database & Collection in MongoDB

To start, ensure MongoDB is installed, open MongoDB Compass, and create a new database and collection. Import data using “Add Data” and “Import JSON or CSV File.” Create another collection for processed data. This setup will allow us to explore Data Masking and Data Unmasking techniques in Talend using MongoDB as the data source and destination.

Create a New Job in Talend

If Talend is already installed on your computer, open the application and create a new project. In the Repository section, click on “Job Designs,” then right-click and select “Create Standard Job” under the “Standard” category. This will allow you to start building your data integration job using Talend.

Create a MongoDB Connection

In the Repository section, right-click on “NoSQL Connections” and select “Create Connections.” Write the connection name and choose “MongoDB” in the DB Type column. Then, select the MongoDB version previously installed in the DB Version column. Enter the name of the database you are targeting in the Database column. Click the “Check” button, and if the connection is successful, click “Finish” to save the MongoDB connection details.

Create MongoDB Components

Drag the connection components that have been created into the design workspace, then select tMongoDBInput for input and tMongoDBOutput for output.

Edit Schema in MongoDB Components

Double-click the MongoDB_Connection component and select “Edit Schema” in the Component tab. Enter the schema according to the MongoDB Collection for both input and output stages. Fill in the Collection name specified in MongoDB for seamless data processing in your Talend job.

Create a tDataMasking Component

On the Palette tab, go to the Data Privacy section and drag the tDataMasking component to the workspace. Connect the input side of the MongoDB component to the tDataMasking component. Double-click on tDataMasking, and click “Sync Columns” to verify if the schema matches the one in MongoDB. In the Modifications table, configure the Data Masking process as follows:

Perform email masking by replacing the first part (local part) with the character “x”.
Apply address masking.
Generate a mobile number using a US phone number format.
Apply Character Handling using FFS with AES on the first_name column.
Apply Character Handling using FFS with AES on the last_name column.

Create a tDataUnmasking Component

Go to the Palette tab and find the Data Privacy section. Drag the tDataUnmasking component into the workspace. Connect the tReplicate component to tDataUnmasking. Double-click on tDataUnmasking and click “Sync Columns” to verify if the schema matches the one in MongoDB. Since the Data Unmasking component applies only to the Data Masking process with the FF1 method, configure the Data Unmasking process in the Modifications section as follows:

Apply Character Handling using FFS with AES encryption on the first_name column.
Apply Character Handling using FFS with AES encryption on the last_name column.

By doing this, you will successfully implement the specified Data Unmasking techniques for the data processed by the Data Masking component using the FF1 method in your Talend job.

Run The Job & Compare the Output

The image below displays the original data processing results in the Data Masking and Data Unmasking stages.

Comparing Data Masking & Data Unmasking Output

Conclusion

This blog introduces the concept of data masking and its importance in ensuring data privacy and security. It highlights Talend as a solution for implementing data masking techniques. The blog discusses various data masking approaches and how they can be utilized in Talend to protect sensitive information while maintaining data usability. With practical examples and step-by-step instructions, the blog demonstrates how to set up MongoDB connections and apply data masking and unmasking techniques using Talend.