Announcing Public Preview of Custom Classification and new SQL interfaces

Data classification in Snowflake provides a way to identify sensitive information in the data. Once the data has been classified and tagged, it makes it easier to discover and govern. It also helps in understanding the risks associated with the data thereby ensuring the right protection policies are in place to control access to the sensitive information.

To help with sensitive data classification, Snowflake provides several native classifiers that can identify PII (Personally Identifiable Information), PCI (Payment Card Industry) and other sensitive categories in your data like Name, Address, Social Security Number, Passport number, Credit Card numbers.

While native classifiers are great at identifying general sensitive information, there is often industry- and domain-specific data that customers want to classify and protect. For e.g., internal Employee IDs, or medical health code information associated with the patients is often considered sensitive and confidential information. We are excited to announce support for Custom Classifiers in Public Preview. Custom Classifiers can be written by the customers to extend Snowflake’s native classification capabilities and identify sensitive information across all their data in Snowflake.

Define and use custom classifiers

Snowflake provides the CUSTOM_CLASSIFIER class in the SNOWFLAKE.DATA_PRIVACY schema to enable data engineers to extend the data classification capabilities. Classes and instances are schema-level objects in Snowflake. You can think of a class as an extensible Snowflake object type and an instance as a Snowflake object. A class provides a public API through stored procedures and functions. Collectively they are referred to as class methods. You can learn more about Snowflake Class here.

Once you create an instance of the class, you can call a method to define your own semantic category, specify the privacy category, and specify regular expressions to match column value patterns while optionally match the column name.

As an example, to identify ICD-10 medical codes in the data, you can create an instance medical_codes() of the class CUSTOM_CLASSIFIER like this:

CREATE [ OR REPLACE ] SNOWFLAKE.DATA_PRIVACY.CUSTOM_CLASSIFIER medical_codes();

Then, call the custom_classifier!ADD_REGEX method on the instance to specify the system tags and regular expression to identify ICD-10 codes in a column. The regular expression in this example matches all possible ICD-10 codes. The column name regular expression and the description are optional:

CALL <custom_classifier>!ADD_REGEX(
'<semantic_category>',
'<privacy_category>',
'<value_regex>',
[ <column_name_regex> ]
[ <description> ]
);

Example:

CALL medical_codes!ADD_REGEX(
'ICD_10_CODES',
'IDENTIFIER',
'[A-TV-Z][0-9][0-9AB]\.?[0-9A-TV-Z]{0,4}',
'ICD.*',
'Add a regex to identify ICD-10 medical codes in a column'
);

Finally, call the new SYSTEM$CLASSIFY stored procedure, passing in the custom classifier instances to classify the table that contains the column with medical codes, and automatically assign the custom recommended tag to the column.

CALL SYSTEM$CLASSIFY(
'data.tables.patient_diagnosis',
{'auto_tag': true, 'custom_classifiers': ['data.classifiers.medical_codes']}
);

Auto-tagging the columns in the table is optional. If you prefer to evaluate the classification result first, remove the auto-tag argument from the stored procedure. After you evaluate the result, assign the Semantic_Category tag to the columns using ASSOCIATE_SEMANTIC_CATEGORY_TAGS or set custom tags using an ALTER TABLE … MODIFY COLUMN … SET TAG command.

Additionally, specifying each custom classifier is optional when you have multiple custom classification instances. Instead, you can specify ‘use_all_custom_classifiers’: true.

You can learn more about Custom Classification and all the supported SQL commands by reading our documentation.

New Classification SQL interfaces

In an effort to continuously improve and simplify the classification process, we recently introduced in private preview, the native data classification in Snowsight and new SQL interfaces:

  • SYSTEM$CLASSIFY
  • SYSTEM$CLASSIFY_SCHEMA
  • SYSTEM$GET_CLASSIFICATION_RESULT
  • DATA_CLASSIFICATION_LATEST

Currently, the classification process requires calling two separate system functions, EXTRACT_SEMANTIC_CATEGORIES to identify sensitive data, and ASSOCIATE_SEMANTIC_CATEGORIES_TAGS to associate the semantic_category tags with the corresponding columns. These APIs have to be executed for each table that you want to classify.

The new SQL stored procedure, SYSTEM$CLASSIFY, combines the two steps into one making it convenient to identify and tag sensitive information in a single step. Additionally, you can also get the most recent classification result for each object using the new SYSTEM$GET_CLASSIFICATION_RESULT function and for all objects using the new DATA_CLASSIFICATION_LATEST view in account_usage.

CALL SYSTEM$CLASSIFY('hr.tables.empl_info', {'auto_tag': true});

SELECT SYSTEM$GET_CLASSIFICATION_RESULT('hr.tables');

SELECT * from SNOWFLAKE.ACCOUNT_USAGE.DATA_CLASSIFICATION_LATEST;

As the name suggests, SYSTEM$CLASSIFY_SCHEMA stored procedure can be used to classify all tables in a schema greatly simplifying the classification process. Repeat these steps for each schema that contains tables that you want to classify.

CALL SYSTEM$CLASSIFY_SCHEMA('hr.tables', {'auto_tag': true});

You can learn more about the new Classification SQL APIs by reading our documentation.

We are continuously making improvements to the Data Classification capabilities in Snowflake so stay tuned for more updates. Feel free to reach out to us for any feedback or questions.

--

--