Snowflake Data Classification & Tagging

Fru
DemoHub Tutorials
Published in
4 min readJun 28, 2024

Note: This tutorial is self-contained, with sample data and scripts provided to ensure you can follow the exercise in your practice environment. For more tutorials, visit tutorials.demohub.dev

As businesses accumulate vast amounts of data, it becomes increasingly crucial to understand the sensitivity of different data elements and apply appropriate protection measures. Snowflake Data Classification addresses this challenge head-on, providing a streamlined way to:

  • Identify Sensitive Data: Pinpoint Personally Identifiable Information (PII), financial data, or other confidential information within your tables.
  • Apply Security Measures: Tailor access controls, masking, or anonymization based on the sensitivity level of each data element.
  • Meet Compliance Requirements: Ensure adherence to data privacy regulations like GDPR, CCPA, and HIPAA.

How Data Classification Works in Snowflake

Snowflake offers two types of data classifiers:

  • System Classifiers: Pre-defined classifiers provided by Snowflake to identify common sensitive data types like PII, financial data, and more.
  • Custom Classifiers: Allow you to define your own rules for identifying specific patterns or types of data that are unique to your organization.

A Practical Example: Sales Data Analysis

Let’s explore how to use Snowflake Data Classification in a practical scenario.

⛁ Sample Data Model: salesdb-data-model

Setup the SalesDB sample data model and proceed with the rest of the exercises. This model provides a robust foundation for tracking customers, buyers, clients, and sales opportunities.

-- Assign the database schema
use schema salesdb.custs;

Let’s delve into how Snowflake Copilot can supercharge your data workflow:

1. System Classification: A Starting Point

Snowflake’s system classifiers are an excellent starting point. They can automatically identify common patterns in your data and apply corresponding tags.

-- Classify all tables in the custs schema using built-in classifiers
CALL SYSTEM$CLASSIFY_SCHEMA('SalesDB.custs', {'auto_tag': true});

This will attempt to identify sensitive data in your schema and automatically apply relevant tags like PII.

2. Custom Classifiers

While Snowflake provides built-in classifiers for common data types like PII (Personally Identifiable Information), your organization may have unique data patterns that require custom classification. This chapter will guide you through creating and applying custom classifiers to identify and tag specific data elements within your Snowflake environment, enhancing your data governance and security practices.

Custom classifiers allow you to define your own rules for identifying specific types of sensitive data. This is crucial when you have data patterns that don’t fit the mold of Snowflake’s built-in classifiers.

Imagine you have a LoyaltyNumber column in your Customer table that follows the format “LTY-XXXXX” (where X’s represent digits). Snowflake’s built-in classifiers might not recognize this as sensitive data. Custom classifiers let you tailor your data governance strategy to your specific needs.

3. Creating a Custom Classifier

Follow these steps to create a custom classifier in Snowflake:

Create the Classifier:

CREATE OR REPLACE SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER loyalty_codes();

This creates an empty custom classifier instance named loyalty_codes.

Verify Creation:

SHOW SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER;

This command verifies that your custom classifier has been created successfully. The output should show your loyalty_codes classifier.

Add Regular Expression (Regex) Rules:

CALL Loyalty_codes!ADD_REGEX(
'LTY_CODES', -- Semantic category name for the tag
'IDENTIFIER', -- Privacy category (e.g., 'IDENTIFIER', 'QUASI_IDENTIFIER', 'SENSITIVE')
'\\bLTY-\\d{5}\\b', -- Regular expression to match the loyalty number pattern
'LTY*', -- Optional: column name pattern to match
'Regex to identify Loyalty codes in a column' -- Comment describing the rule
);

This adds a rule to the loyalty_codes classifier. The regular expression \bLTY-\d{5}\b matches the specific format of your loyalty numbers (LTY followed by a dash and five digits).

Verify the Regex:

SELECT varnumber -- Assumes the column name is 'varnumber' in this example
FROM customer
WHERE varnumber REGEXP('\\bLTY-\\d{5}\\b');

Run this query to verify that the regex pattern is working. You may need to replace varnumber with the actual column name in your table.

List the Rules:

SELECT loyalty_codes!LIST();

This will display the details of your added rule, including the regex pattern, semantic category, and comment.

4. Classifying Your Data

With your custom classifier in place, you can now use it to classify the relevant columns in your tables. This process involves two methods:

Automated Tagging (Recommended):

CALL SYSTEM$CLASSIFY(
'salesdb.custs.customer', -- Specify the table to classify
{'auto_tag': true, 'custom_classifiers': ['Loyalty_codes']} -- Enable auto-tagging and use your custom classifier
);

Manual Tagging:

SELECT SYSTEM$GET_CLASSIFICATION_RESULT('salesdb.custs.customer');

Review the suggested classifications and manually apply them to your table columns.

Note: Repeat this Step using the Snowsight UI Interface.

5. Dropping the Custom Classifier (Optional)

If you no longer need the custom classifier, you can drop it:

DROP SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER data.classifiers.loyalty_codes;

6. Applying Masking Policies to Tags

Now that our data is tagged, we can create masking policies to protect sensitive information: To apply the masking policies to the tagged columns, you would use the following SQL commands:

⛁ Sample Data Model: salesdb-data-model. [Execute STEP #7 CREATE MASKING AND APPLY TO TAGS/COLUMNS]

These masking policies will automatically obscure data in columns tagged with ‘PII’ or ‘Financial’, ensuring that sensitive data is hidden from unauthorized viewers.

7. Key Takeaways

  • Custom classifiers are a powerful way to identify and tag sensitive data that doesn’t fit standard patterns.
  • Snowflake provides a flexible framework to define your own classification rules using regular expressions.
  • By combining custom classifiers with RBAC, you can implement granular access controls for your sensitive data.
  • Regular review of your data classification strategy ensures ongoing data protection and compliance.

8. Documentation

Originally published at https://tutorials.demohub.dev.

--

--

Fru
DemoHub Tutorials

🎥 800K+ YouTube views 📺 6K+ Subscribers 🌐 10K+ DemoHub.dev 💡 Tech Simplifier 🚀 Innovation Catalyst 🤝 Data Analytics & AI Leader | 🔍 All views are my own