Unmasking the Power of Custom Masking: A Step-by-Step Guide

Pratibha Chowdary
intelia
Published in
5 min readFeb 15, 2024

Welcome back, data enthusiasts! Today, we’re diving into the fascinating world of custom masking rules using Google BigQuery’s user-defined functions (UDFs). As mentioned in my previous blog post (link), data masking rules are a crucial part of column-level access controls, which are configured using policy tags (learn more).

Before we get too serious,

Why did the data masking rule break up with the database?

Because it couldn’t handle the unmasked truth!

With that out of my system, let us get to some practical use cases where dynamic data masking would be a very valuable technique for protecting sensitive data. A few use cases I can think of are Healthcare Systems, Human Resources Systems, Financial Applications, Compliance and Regulatory Requirements, Testing and Development Environments and many more. So pretty much anywhere where sensitive data needs to be protected, isn’t it?

In this instalment, we’ll delve into a real-world use case where a custom masking routine is the hero we need to safeguard sensitive information.

— -

Consider this — you have a dataset with numerical account information, and you need to ensure that only a portion of the account number is visible. So, the use case at hand is to mask characters between the first six and last four digits only while both preserving data privacy and maintaining the usability of the dataset.

Now, armed with this mission we can consider the plethora of data masking options available in Google BigQuery, which includes Hash (SHA-256), Email mask, Last four characters, First four characters, Date year mask, Default masking value, and Nullify. However, we can see a custom masking routine is the best fit for this particular scenario as none of the other options give us the desired outcome.

Yes, Cloud Data Loss Prevention (Cloud DLP) is undoubtedly a viable option for protecting sensitive data. However, its utility extends beyond just safeguarding data at rest; it also serves as an effective solution for securing data in transit and text fields where sensitive information might be present without a predefined format. In scenarios where the sensitive data follows a structured format at rest, Cloud DLP could be considered excessive and costly. For those interested in understanding the cost implications of using Cloud DLP, Google provides detailed pricing information. You can find more about Cloud DLP pricing [here] and estimate costs for your specific project by visiting [this link]. Additionally, for insights into pricing considerations regarding Policy Tags, refer to Google’s documentation [here].

Coming back to our scenario, I’ve created some sample dataset with dummy values as shown in the screenshot below. We will only focus on custom masking the “Account Number” column for the purpose of this blog post.

— -

Step 1: Handcraft the Perfect UDF

Our first step begins with the creation of a custom UDF, the most important part of our masking routine. Open your BigQuery console and execute the following SQL query:

```sql

CREATE OR REPLACE FUNCTION dataset.custom_mask(NUMBER_ACCOUNT STRING) RETURNS STRING

OPTIONS (data_governance_type=”DATA_MASKING”) AS (

SAFE.REGEXP_REPLACE(

NUMBER_ACCOUNT,

r’^(.{6}).*(.{4})$’,

r’\1' || REPEAT(‘*’, LENGTH(NUMBER_ACCOUNT) — 8) || r’\2'

));

```

This UDF employs regular expressions to dynamically mask sensitive information. It smartly conceals all characters between the first six and last four characters with asterisks, securing data privacy without offering up utility. This pattern of custom masking can be extended to other applications by modifying the regex to suit specific requirements.

— -

Step 2: Creating a Policy Tag

Now that our UDF is aching to start, let’s prepare the ground for implementing it. Head to the BigQuery Console once again and create a policy tag devoted to our custom masking rule. This tag acts as a marker, helping us easily link the rule to the relevant columns.

— -

Step 3: Attaching the Rule to the Policy Tag

With our policy tag in place, it’s time to give it a purpose. Attach the custom masking rule to the policy tag by joining them together. This connection ensures that our UDF knows where to work its spell.

— -

Step 4: Dynamic Masking in Action

Now the drama hits its high point as we tag the specific column that needs dynamic masking. Once we link a column to our policy tag, ta-da! your sensitive data is guarded against prying eyes while still maintaining the reliability of the dataset.

— -

Bravo! You’ve just mastered the art of custom masking in Google BigQuery. By handcrafting a customised UDF, creating a policy tag, and linking the two, you’ve effortlessly blended dynamic masking into your data workflow. Now, sensitive information can coexist with analytics, ensuring privacy is maintained without compromising data-driven insights.

So, the next time you find yourself piloting the complexities of data governance, remember the power of custom masking. Your data deserves nothing but the best protection, and with versatile tools available in Google BigQuery, you are well-equipped to make it happen.

Happy masking!

Disclaimer: This article is based on my research and my understanding stemming from that. I’m happy to make corrections and learn along the way in case of any misunderstandings.

--

--