Demystifying Data Masking: A Comprehensive Guide on BigQuery and Snowflake

Published in

Odicis-Data-Engineering

6 min readOct 30, 2023

Introduction:

In the era where data-driven decision-making dominates, protecting sensitive information stands as a crucial responsibility for organizations. With the rising occurrences of data breaches, it’s imperative to adopt stringent data protection mechanisms. Data masking surfaces as a vital strategy in safeguarding sensitive data, especially in non-production environments, ensuring its usability remains intact for testing, development, analytics, and reporting purposes.

Typically applied to Personally Identifiable Information (PII), data masking ensures that confidential data remains inaccessible to unauthorized personnel. While the majority of roles within an organization may not require access to PII, specific teams such as customer support might need access to this sensitive information to perform their duties effectively.

This blog aims to explore how data masking can be effectively implemented within two leading data warehouses: Google BigQuery and Snowflake, providing insights on achieving a balance between data security and functionality.

Understanding the Imperative of Data Masking:

Data Masking: A Necessity, Not a Choice:

Compliance and Regulatory Adherence:
Ensuring compliance with stringent regulations such as GDPR, CCPA, and HIPAA is non-negotiable. Data masking helps in meeting these requirements, particularly concerning personally identifiable information (PII).
Facilitating Secure Development and Testing:
Developers and testers require access to real data to ensure the effectiveness of their work. Data masking provides them with realistic data, minus the sensitive elements, striking the right balance between utility and security.
Enhancing Privacy in Analytics and Reporting:
Many analytical workflows can function effectively with masked data, ensuring user privacy while extracting valuable insights.

Data Masking Strategies

Substitution: Replacing sensitive data with fabricated, yet realistic data. For instance, transforming “John Doe” to “Alan Smith.”
Shuffling: Disassociating data by rearranging values within a column, preserving data format.
Redaction: Hiding specific data, replacing it with generic or blank values.
Randomization: Generating data based on random values, ensuring unpredictability.
Tokenization: Replacing original data with tokens, securely storing the mappings separately.

Implementing Data Masking in BigQuery:

Despite not having native Dynamic Data Masking (DDM) features like Snowflake, BigQuery offers robust alternatives through Policy Tags and authorized Views.

Steps:

a) Set Up Policy Tags:

Go to BigQuery in the Google Cloud Console.
Navigate to your dataset and then to “Data Catalog”.
Here, you can create Policy Tags, which can be used to categorize columns that contain sensitive information.

b) Apply Policy Tags to Columns:

Once your tags are created, apply them to appropriate columns in your tables.

c) Use Authorized Views:

Create views that mask the data in columns tagged with specific Policy Tags.
For example, if a column is tagged as “Default masking value”, the view might replace characters in the customer_id column.

d) Grant Permissions:

Provide access to only the authorized views for specific user roles, ensuring that they can only see the masked data.

Implementing Data Masking in Snowflake:

Dynamic Data Masking (DDM) in Snowflake is a crucial feature for ensuring data security and privacy, as it enables organizations to define policies that mask sensitive data based on the user’s role. This ensures that only authorized users have access to sensitive information, while others see a masked version of the data. Below is an in-depth description and an example of how to implement DDM in Snowflake using an alternative approach.

Steps for an Alternative Approach:

a) Grant Masking Policy Management Privileges:

Objective: Empower a designated security or privacy officer with the ability to manage masking policies through a custom role.

Steps:

Create a custom role, e.g., MASKING_ADMIN.
Grant CREATE MASKING POLICY and APPLY MASKING POLICY privileges to this role:

GRANT CREATE MASKING POLICY, APPLY MASKING POLICY TO ROLE MASKING_ADMIN;

Optionally, grant APPLY ON MASKING POLICY to allow object owners to apply/unset masking policies.

b) Assign Custom Role to Users:

Objective: Allocate the MASKING_ADMIN role to the appropriate security or privacy officer. Steps:

Execute:

GRANT ROLE MASKING_ADMIN TO USER <user_name>;

c) Create and Define Masking Policies:

Objective: Using the MASKING_ADMIN role, establish masking policies and connect them to columns that store sensitive data. Steps:

Define a masking policy using the CREATE OR REPLACE MASKING POLICY statement:

CREATE OR REPLACE MASKING POLICY <policy_name> AS (val STRING) RETURNS STRING -> 
CASE   
  WHEN CURRENT_ROLE() IN ('FULL_ACCESS_ROLE') THEN val   
  ELSE 'MASKED' END;

Tailor conditions based on user roles or other context functions.

d) Apply Masking Policies to Columns:

Objective: Link the crafted masking policies to the specific table or view columns. Steps:

Use ALTER TABLE or ALTER VIEW to assign the masking policy to the necessary columns:

ALTER TABLE <table_name> MODIFY COLUMN <column_name> 
SET MASKING POLICY <policy_name>;

e) Execute Queries and Observe Masking:

Objective: Conduct queries as different users to confirm the proper application of the masking policies. Steps:

Run queries under various roles and inspect the output to ensure data is masked or revealed appropriately.

f) Use Additional Masking Examples (Optional):

Objective: Explore and implement supplementary masking examples catering to specific scenarios, such as partial masks, full masks, or using functions like DECRYPT for encrypted data. Examples:

For partial masking:

CREATE OR REPLACE MASKING POLICY partial_mask_policy AS (val STRING) RETURNS STRING -> 
CASE   
  WHEN CURRENT_ROLE() IN ('FULL_ACCESS_ROLE') THEN val   
  ELSE LEFT(val, 2) || '****' || RIGHT(val, 2) END;

Apply this policy to the relevant columns and observe how only the first two and last two characters are visible for users without ‘FULL_ACCESS_ROLE’.

In essence, Dynamic Data Masking in Snowflake allows for the granular control of data visibility, ensuring that sensitive information remains secure while still accessible to authorized personnel. The alternative approach outlined above provides a comprehensive strategy for implementing DDM, ensuring robust data protection across the organization.

Considerations for Effective Data Masking:

Performance: Be mindful of performance impacts, especially with large datasets. Efficient masking strategies and proper indexing can mitigate these effects.
Consistency: Aim for consistent results in masking to maintain data integrity and avoid confusion.
Audit and Monitoring: Implement robust auditing and monitoring practices to review access patterns, ensuring adherence to compliance and security standards.

Conclusion:

Embracing data masking is a proactive step towards fortifying data security and ensuring compliance in today’s data-centric world. Both Google BigQuery and Snowflake offer viable paths for implementing data masking, each with its unique approach. By understanding the nuances of each platform and employing best practices, organizations can effectively safeguard sensitive information, fostering a secure and compliant data environment.

Thank you for being a part of our community! Before you go:

Be sure to clap and follow the writer! 👏
You can find even more content at Odicis 🚀
Follow us on LinkedIn.