Data Vault on Snowflake: a Classification & Tagging Framework
A common challenge in modelling any data is identifying what (if any) data needs to be highly protected, confidential or private and what to do about it? Privacy semantics is prevalent in all customer-related data, and sometimes a combination of data columns can be quasi-identifying and therefore vulnerable to unauthorised exposure. For a multi-national organization, the complexity can be further extended by having to comply with multiple jurisdictions and regulations too. The consequence of not protecting customer data you are a custodian of can be wide ranging from hefty fines imposed by industry regulators to even a loss of license to continue legally trading.
Yes, data modelling and enterprise architecture play a crucial role, Snowflake does indeed include features to help you protect your data. By leveraging what Snowflake provides out of the box it can in fact influence how you model your data and how you leverage the storage life cycle policies for your data. Yes, this tagging and classification framework should be used in combination with your storage life cycle policy that should automate what data should be archived and purged according to your security posture. Data classification can also influence your data architecture by how you decide to manage perimeters around sensitive data and what roles are used to process your data and how.
Let’s begin by identifying (pun unintended) what data classification is, how to apply it and its effect on data vault.
Episode 22: a Classification & Tagging Framework
Data classification leads to different data governance treatments, that is that data classification is directly correlated to the application of access of the data you manage. Data classification is generally categorised as:
- Public — information that is publicly available such as promotional publications and media information; for example, public websites, public research outputs. The intent of this data is for public use and poses no external threat to your organisation when published.
- Internal — data or information disseminated with a business need for an internal audience; such as project documentation, aggregated data, de-identified low cardinality data, company policies and procedures. If this data is exposed it could pose a short-term competitive disadvantage to the custodians of that data.
- Sensitive / Confidential — businesses need data to fulfil specific business objectives; this may include contact information, pricing information and marketing details. If leaked the information could be damaging to the organisation.
- Highly sensitive / Restricted — information that is restricted and can be used to identify individuals uniquely which could be used to even impersonate that individual. Examples such as passport numbers, tax file numbers, social security numbers, credit card numbers, date of birth or address, personal financial or medical information and biometric data. If leaked the custodian could face those hefty penalties and reputational risk but depending on the industry, it is necessary for that organisation to record that information.
Corporations operate in different industries that have overlapping legal, regulatory and contractual requirements for managing customer data. Some examples include,
- General Data Protection Regulation (GDPR) — regulation in European Union (EU) law on data protection and privacy. Like PIPEDA in Canada, CCPA in California.
- Health Insurance Portability and Accountability Act HIPAA — a US federal law providing data privacy and security provisions for safeguarding medical information.
- Payment Card Industry Data Security Standard (PCI DSS) — a set of security standards designed to ensure that all companies that accept, process, store, or transmit credit card information maintain a secure environment.
- System and Organization Controls 2 (SOC 2) –a set of standards for managing customer data based on five “Trust Service Criteria” (TSC). Developed is specifically designed for service providers storing customer data in the cloud.
How do we mix technology with the constraints we’ve identified above and still deliver business value?
It’s all semantics
Data Isn’t The New Oil — Time Is — bit.ly/3zDBcsg
Conceptually, data has properties not comparable to oil;
- Unlike oil, data has no scarcity. If it is shared then others will have “copies” of that data.
- Data on its own has no value unless it is given context and refined to extract meaningful insight, whereas oil has intrinsic value as raw material.
- Data has sensitivity and requires classification for its use.
- Data is ephemeral, it can be as easily destroyed out of existence. Oil cannot be deleted but instead transformed into another state (gasolene, plastics, clothing…etc.).
- The value of data depreciates sharply over time.
Various sources: Wired.com, Forbes, MIT technology review, McKinsey & Company
These properties of data make the planning and execution of data classification crucial.
What does Snowflake offer?
Snowflake being an enterprise AI Data Cloud platform, data privacy and security is top of mind with all the features released for our customers. The features we will highlight in this article are,
- Built-in classification — Snowflake will analyse target table columns and optionally auto-tag the columns according to their semantic (ex. driver’s license number, passport number…) and privacy categories (identifier and quasi-identifier).
- Custom classification — You can define your own custom classifier using a regex expression to identify your semantic and privacy categories.
- Object tagging — is the ability to tag table objects and columns with free or constrained labels, Snowflake data classification provides the auto-tags SEMANTIC_CATEGORY and PRIVACY_CATEGORY after classification. You can expand tagging with tag-based masking policies.
Snowflake data classification has two options for running classifications,
- Asynchronous execution via “call system$classify_schema(‘${target-schema}’, ‘${options}’)”, and
- Synchronous execution via “call system$classify(‘${target-table}’, ‘${options}’)”
Asynchronous execution will classify all table objects in a schema, but you will be unsure of when the classification results are ready and hence you will need to poll the output of the system$get_classification_result stored procedure until the results are in. Schema-level classification can be run using Snowsight which uses this system stored procedure in the back end.
Synchronous execution profiles one table at a time but it has the benefit of completing only when the stored procedure returns a result, there’s no need to poll for a result.
${options} includes whether to apply Snowflake’s privacy and classification tags automatically based on the classification results.
With custom classification you use regex expressions and define what privacy and semantic category it falls under.
If we know what the profile of what our business keys resemble, utilising custom classification we can effectively trace wherever those business keys occur. For a denormalized landed file it may even serve to learn more about the business process landed for consumption then we previously knew. More on this later!
First.. what is…
The impact on data architecture?
When a business case is presented the same delivery stakeholders are gathered,
- Technical business analyst (BA) — the technical resource with a clear understanding of the business case, the BA will know what class of data she needs.
- Data Architect / data modeller — how the source model will proliferate through the layers of the data architecture and what to do with the various classes of data.
- Source-system-subject matter expert (SSSME) — the best person to provide insight on how to get the data you need and what class of data is available, for this you can explicitly set the classification for columns, values or attributes. However, if you are about to ingest 500 or 5000 columns profiling all these attributes manually will prove to be challenging.
Together, these roles will articulate what the data model will be and the steel thread to carry that data from raw to insight. Traditionally, the SSSME can articulate what classification levels are being projected by the data supplied downstream. However, even if they do, you should run your data classification process on the provided data. The question is, if you’re running your data classification process after the data is landed is the data already being consumed downstream?
You must design your data architecture layers, role-based access control (RBAC) and data governance to cater for that scenario. There are generally three approaches to managing this scenario:
- New data is not visible, RBAC based on individual objects without future grants.
- New data is visible but masked everywhere (tag-based masking, tags applied to a table is inherited by the table’s columns by data type).
- New data is loaded to a completely hidden schema, classified before being made available to a Demilitarised Zone (DMZ) for downstream layers, this can be done using SQL secured views, table cloning or copy operations.
Whatever your solution, you must cater for schema evolution, what happens if a new column is added, changed or dropped? Do you reclassify? Can you see that if you have an overloaded column that data classification becomes unfeasible?
Data mesh proposes that each domain own and manage their own data and make that data available through a data catalogue as a data product with defined data contracts. A completely decentralised approach to managing analytical data is diametrically at odds with the intent of analytics in first place: maintaining a single version of the truth (SVOT, data is not a microservice). Can we meet Data Mesh halfway? Yes, a single Snowflake account whose layers are split by data producers, data aggregators and data consumers.
The significance of this architecture is profound, because we have profiled the data and landed it in the data producers’ zone once, all downstream layers inherit the work of the data profiling upfront. Should a downstream domain have roles with the privileged access to work with de-identified data then it is all controlled within the explicit use of RBAC, and data governance policies applied by data classification. Define the classification and tagging once, and all downstream domains inherit this classification by tags.
Now.. what about…
The impact on the Data Model?
Data profiling is a vital aspect of data modelling, for analytical data models the data typing is mostly defined by what is being provided by data providers. This job is easy. If data classification is already provided by data producers then executing data classification procedures may seem redundant, unless you’re dealing with hundreds of columns or semi-structured data. Snowflake’s built-in classification will identify what is called identifiers and quasi-identifiers. Basically, an identifier is a single column that can be uniquely used to identify an individual, quasi-identifiers are a combination of columns that can uniquely identify an individual.
These identifiers are usually government issued (social security numbers, passport numbers), or financially sensitive (credit card numbers) or in combination are personally identifiable (address, date of birth etc.). We mentioned that there is legislation and regulation governing these, you should look them up for your industry! The unique property of these attributes is that they are identifying; meaning that they hardly (if ever) change. This property of identifiers is profound to your data model, it means that you should split these descriptive attributes to its own satellite table in a data vault model. While the other state information about a customer may change at whatever rate they need to, you will likely only ever have one record for this personally identifiable information. This means that should you need to remove an individual from your data warehouse all you need to do is to permanently obfuscate that record in that personally identifiable satellite table without affecting the aggregate value of the rest of your data warehouse. By splitting these attributes from the non-identifying attributes, it means that existing queries utilizing an SQL equijoin will still return the same records and be auditable.
The second point we will make about government issued identifiers is that you must never use them as business keys (never load them to hub tables). These are personally identifiable and instead your enterprise architecture must have a process to issue and use your own customer ids to uniquely identify that customer throughout your business.
Since these ids will be designed, defined and issued by your business you must have an algorithm to produce that id. You can define these under your custom classification and use Snowflake’s built-in and custom classification to identify business key candidates and personally identifiable attributes. Yes, data classification can help you model your data vault.
Business keys provide the passive integration between source systems and raw (RV) and business vault (BV). RV + BV = DV.
Note that you might choose to use your own data classification tags instead of those supplied by Snowflake. This is a common use case, and we generally find customers want the flexibility of applying their own tags however many they need. We would want to apply our own tags using custom classification to identify business key columns.
How does this…
Impact Consumer Domains
Point-in-Time (PIT) and Bridge tables are not Business Vault artefacts because they do not provide the same auditability, agility and automation as the rest of data vault. What happens in the data vault does not affect the downstream PITs and Bridges because these tables only contain keys and dates. In fact, PITs and Bridges are ephemeral structures built to shorten the path between hub tables (bridges) and provide snapshots of the satellite records at a point in time (PITs).
The impact of data classification surfaces in data management when an ordinance like GDPR’s “Right to be Forgotten” (aka Article 17) is issued. Although the ordinance specifies that a custodian of that customer’s data has 30 days from the time the request is made to forget that customer, there are caveats to that rule. For instance, for financial fraud purposes we would not want to forget that customer and thus that domain will have a substantive need to retain that customer information. Yes, a consumer domain on top of the aggregate domain may have a need for data that is no longer available upstream, and for good reason.
Domain owners have the right to authorise deletion requests, do not assume that deletions should automatically propagate to downstream domain data.
It takes a village
Just like building a data vault, standing up a data governance practice needs data executive sponsorship, investment and time. Utilise as much automation as you can and let the technology auto-classify and auto-tag your data assets as early as possible and the data products built from that data will inherit the necessary classification and tagging to your downstream domains. What we have shown is why certain aspects of data analytics on a data mesh favours being centralised over being completely decentralised. We have also shown that data vault’s repeatable patterns favour a repeatable data classification and tagging framework.
As the title of this section professes, IT and/or analytics should not define and design data classification alone. Snowflake is merely a tool that should enable you to automate data privacy and semantic classification, you and your business colleagues should still do the work to define what that is and how that fits with your organization’s data security posture. Getting this wrong affects the whole of your business!
Until next time!
References
- Spirion Data Classification Guide, spirion.com/data-classification
- Snowflake Data Classification, bit.ly/3zI7HWk
The views expressed in this article are that of my own, you should test implementation performance before committing to this implementation. The author provides no guarantees in this regard.