Approach to Data Security in Snowflake Part 2 — Snowflake and Custom Data Classification

Ramesh Sanap
clouddataplatform
Published in
4 min readMar 30, 2024
Photo by Andrei Zolotarev on Unsplash

Disclaimer: This blog does not delve into the technical intricacies of writing SQL scripts for Data Classification, which are better explained in Snowflake Documentation. The purpose of this blog is to elucidate the architecture and approach towards data classification, especially in scenarios where enterprise data classification does not align with Snowflake-defined categories, a common occurrence in large-scale enterprises.

Part 1 — Introduction to Snowflake Data Security

Part 3 — Custom Data Classification using TAG automation approach

In Part 2 of blog series on Data Security in Snowflake, we explore Snowflake Data Classification and Custom Data Classification. This segment aims to elaborate on the significance of data classification within Snowflake and how it can be effectively managed using the various features offered by the platform.

Every organization deals with a myriad of data types, spanning personal information, financial records, trade secrets, research findings, and patents. Without robust data governance and security measures encompassing all systems, data security remains a pressing concern.

The foundational step towards fortifying data security lies in data classification — the process of categorizing data based on its characteristics. This involves identifying whether data pertains to individuals, subjects, finances, trades, organizational secrets, or other specific domains.

Of paramount importance is personal identifiable information (PII) data, which is subject to stringent regulations in many countries such as GDPR or CCPA, mandating how enterprises must handle such data.

Understanding overall approach in Snowflake for Classification

Snowflake Data Classification
  1. Approach of Data Classification
  2. Snowflake Data Classification using Snowflake functions API
  3. PRIVACY CATEGORIES as described by Snowflake IDENTIFIER, QUASI_IDENTIFIER and SENSITIVE
  4. SEMANTIC CATEGORIES covering various values like PII, Account, Address, Salary, Geography information's
  5. Custom Snowflake Classifier created using Snowflake CREATE CUSTOM_CLASSIFIER for custom SEMANTIC CATEGORY ultimately mapping to Snowflake’s PRIVACY CATEGORY
  6. Semi Data Classification though termed as Custom Classification where enterprise can define their own specific classification for unmapped Semantics from Snowflake and map to specific Privacy category like Sematic may be Medical Code and Privacy may be Sensitive
  7. Enterprise custom classification automation which is independent of Snowflake’s classification to cover custom data governance model with broad classification like Trade secrets, financial figures, secret codes etc
  8. The whole automation process which will enable enterprise to define custom classification without using Snowflake API classification functions or CREATE CUSTOM_CLASSIFIER, Enterprise will have greater control with own classification and further extended to Data Mesh capabilities where in Domain Owners can control classification based on their Governance Model
Typical flow of approach

The diagram above illustrates how an overarching approach can be implemented to define custom data classification using Snowflake TAGS and TAG-based policies.

The first step involves utilizing an existing Data Classification Matrix developed over time within an enterprise by the Data Governance team. This includes understanding and creating TAGS definitions (not tags) that can be utilized to classify fields.

The second step is to analyze fields based on semantics, which are essentially analyzed by Snowflake to extract semantics due to either name or value extractions. For instance, if a field is named “employee_monthly_earning,” Snowflake’s Extract Semantics may not recognize it as “Salary” and mark it with tags such as “SALARY” and “SENSITIVE.”

The third step entails reviewing the output of the custom semantic process and deciding whether to associate the tags with fields or not.

The fourth step involves applying the output of the custom semantic extraction to fields so that TAG-based policies can automatically manage masking on fields defined in the data classification per visibility to role.

The third blog in this series will cover the overall custom classification approach with the four steps mentioned above. It’s important to note that this outlines the architecture and not the technical steps to achieve this. For technical implementation at the field level, please refer to the following links. Alternatively for detailed technical steps to achieve this, please do reach out with comments on this blog series.

--

--