Sensitive Data Governance Design Patterns in DBSQL

Databricks SQL SME

Published in

DBSQL SME Engineering

8 min readApr 10, 2024

Authors:
Heeren Sharma, Solutions Architect @ Databricks
Sebastian Kollmann, Solutions Architect @ Databricks

Introduction

In all data warehousing use cases, data management & governance is a critical pillar for getting value out of your data at scale. When building a warehouse on Lakehouse with the Databricks Intelligence Platform, Unity Catalog serves as the backbone for unified governance and visibility across all your data assets. In another Databricks blog, we covered how the data privilege model and access controls work in Unity Catalog, and if you are new to Unity Catalog, then the Primer for Unity Catalog Onboarding might provide some insights. In this post, we will touch on one significant data governance and security requirement: how to control sensitive data using different data access control patterns with rules based and AI-based masking policies.

Dynamic masking allows you to control access to specific rows and columns in your data assets — backed by Unity Catalog. It’s particularly essential when you need fine-grained access to sensitive data (such as Personal Identifiable Information PII) that should not be accessible to all users by default — as it is an important topic to ensure data security and compliance.

Design Patterns

In DBSQL* (Databricks SQL), there are four different patterns to control fine-grained access: Views, Data Masking, Dynamic Views, and Row-Level Security and column Masking. In the following sections, we will discuss each of them individually and give you some guidelines for selecting an approach..

Views

Views are the most common abstraction pattern and are straightforward to implement: they are read-only and can encompass logic from one or more tables (or a subset of them) without providing full access to the underlying tables.

Views provide a stable interface: You can consistently consume a view while the underlying objects change, promoting reliability. Rather than exposing the whole underlying objects (which can be base tables or other views), you can simply select the non-sensitive column or sanitize/anonymise sensitive columns in the view definition.

Things to keep in mind:

Avoid using implicit select operations (such as SELECT *) and instead explicitly specify columns. During the creation of a view, the signature (which columns and their corresponding data type) is saved in Unity Catalog and therefore, changes in the underlying objects will not be reflected automatically — which can be unexpected if an implicit select operation is used in the view. This is intended behaviour to avoid exposing any sensitive columns that might be introduced in newer versions of the underlying objects.
Views provide flexibility, but you should avoid making convoluted view definitions. Leveraging complex case-when logic or specific custom table-valued functions can make it difficult to understand or debug, which might result in unintended data leakage.

Below we show an example using system tables which expose usage information across an entire account. Leveraging views, you can disclose usage data to users but only for their relevant workspaces (such as the workspaces for their business unit) while hiding sensitive data about other workspaces (e.g. from another business unit).

create view main.default.usage_for_specific_workspaces as
select
-- explicitly selecting the columns to include
  sku_name,
  usage_start_time,
  usage_end_time,
  usage_quantity
from
  system.billing.usage
where
-- limit the results to filter only on relevant rows
  workspace_id in ('1122334455667788', '8877665544332211')

Data Masking

Data masking helps protect sensitive parts of a specific column. By applying a masking function to a column, the sensitive data will be substituted with the result of the masking function at query time.

Things to keep in mind:

Having no sensitive data is always better than having sensitive data. Try to remove sensitive columns when possible and use data masking if a masked column adds value to your result.
In case there is not only one distinct type of sensitive data (such as email, phone number, etc.) in a column but continuous text, the ai_mask function can be handy which helps you mask specified entities in a given text using SQL.
You can also use tools such as the (pre-installed) Fernet encryption library to unobtrusively secure highly sensitive data instead of masking it. This gives you additional security, as sensitive data is encrypted at rest and can only be accessed by privileged users with the decryption key.

As an example, using hard-coded rule-based data masking, you can hide specific parts of columns in a deterministic way. In this example, we hide the host part of our IP addresses and only show the network part — to make sure this potentially sensitive data can still be used for analytics (e.g. is it coming from an internal/external network) while not exposing all details.

create view main.default.masked_audit as
select
  workspace_id,
  event_time,
  action_name,
  audit_level,
  -- mask the field 'source_ip_address' to show only the first half
  concat(regexp_extract(source_ip_address, '^(\\d+\.\\d+)\.(\\d+\.\\d+)', 1), '.xx.xx') as source_ip_address
from system.access.audit

Alternatively, using ai_mask can help to hide sensitive data within a column without the need for a clearly defined rule, specifically where a hard-coded data masking function is not feasible or too complex to implement. As an example, the full history of cluster configurations can include sensitive data in the cluster name (especially for personal compute clusters). Applying the ai_mask function the column values like Firstname Lastname’s Personal Compute Cluster would become [MASKED]’s Personal Compute Cluster and thus help us to hide sensitive data specifically for such values.

create view main.default.masked_clusters as
select 
  workspace_id,
  cluster_id,
  dbr_version,
  ai_mask(cluster_name, array('person')) as cluster_name_masked
from system.compute.clusters

Dynamic Views

However, if you need to apply specific filters or selectively mask columns for different groups of users, Dynamic Views come in handy. They allow you to configure fine-grained access control with a programmatic approach. Sophisticated logic — such as column-level and row-level filters and/or advanced data masking — can be implemented as a dynamic view to fine-tune filtering decisions at query runtime.

With Unity Catalog, you can leverage built-in functions to dynamically limit users’ ability to access a row, column, or even a specific record in a view definition. There are three functions available but we will only cover one of them: is_account_group_member() returns TRUE if the current user is a member of a specific account-level group. It is the recommended function to use in dynamic views against Unity Catalog data.

Things to keep in mind:

It is best practice to implement group membership-based checks and avoid specific user-based checks. Even if you want to provide access to only one user, we recommend using a group (and creating one for this specific permission) instead of validating the condition against their specific user.
Dynamic views can be used if maintaining separate views per group is becoming a maintenance burden. They help you to give access to different rows and columns of the same data set for multiple groups.
Consider modularizing various filters into different user-defined functions (UDFs) governed by the Unity Catalog. Instead of writing logic in a view definition, define it as a UDF and then call the appropriate function over a row/column filter or data masking function.

The following example illustrates how to create a dynamic view in Unity Catalog. It uses data masking on the column level to show only the source_ip_address to members of the group auditors and a row-level filter to ensure that only account_admins can see events on the account level.

create view main.default.redacted_audit as
select
  event_time,
  -- redact the field 'source_ip_address' and only show it to auditors
  case
    when is_account_group_member('auditors') then source_ip_address
    else 'REDACTED'
  end as source_ip_address,
  action_name,
  request_params,
  response,
  audit_level
from system.access.audit
where
  -- show ACCOUNT_LEVEL events only to the group account_admin and otherwise filter the results out for all other users
  case
    when is_account_group_member('account_admin') then TRUE
    else audit_level = 'WORKSPACE_LEVEL'
  end

In the result set, you can see that both access control patterns are executed (source_ip_address is redacted, and only account level events are hidden) as I am neither an auditor nor an account_admin.

Row Filters & Column Masking

Row Filters and Column Masking — also known as Row-Level Security (RLS) and Column-Level Mask (CLM) — are the latest features that enable fine-grained access control directly at the table level in Unity Catalog (whereas in the previous sections, we implemented access control via a view interface). You can apply this logic either at table creation time (CREATE TABLE) or later by defining the appropriate function and appending it to a table using an ALTER TABLE statement.

Things to keep in mind:

Using Row Filters and Column Masking you can apply the masking logic on a table directly without introducing any new object with a different name.
For complex security models leverage mapping tables to accomplish fine-grained control to implement different rules for groups based on your organizational structure.
Store your SQL user-defined functions (UDF) in Unity Catalog. These functions will be applied and evaluated for any table access operation, and therefore, temporary session-defined UDFs will not work.
Be mindful of various limitations which come with tables associated with row filters and column masks.

Reusing the example from dynamic views, the same logic could be implemented on the table level directly**: we are defining two UDFs with the filtering/masking-logic and registering them in Unity Catalog to then apply them on the table to hide sensitive data.

use main.default;

-- as system tables are read-only
create table system_access_audit as
select * from system.access.audit;

create or replace function filter_workspace_level (audit_level string)
return if(is_account_group_member('account_admin'), true, audit_level='WORKSPACE_LEVEL');

create or replace function redact_for_non_auditors (column string)
return if(is_account_group_member('auditors'), column, 'REDACTED');

alter table system_access_audit
alter column source_ip_address set mask redact_for_non_auditors;

alter table system_access_audit
set row filter filter_workspace_level on (audit_level);

Conclusion

These approaches are not mutually exclusive and can be combined based on your requirements.

First of all, if sensitive data does not add further insights, don’t ingest it: having no sensitive data is always better than having sensitive data. Second, as a simple rule of thumb: if you need to handle sensitive data, try to avoid dynamic views and instead use Row Filters and Column Masks where possible.

Dynamic views have their right to exist (e.g. if you need to rename columns, join data from different objects, or actively hide specific columns) and are especially viable as long as Row Filters and Column Masks have their current limitations.
For both options, use group membership checks instead of user checks.
Apply general engineering best practices and break up complex logic into reusable components (such as sub-functions or lookup tables/permission tables).
Keep it simple: use an ordinary view instead of a dynamic view if it is sufficient.

As a next step, discover our demo on how to define fine-grained access controls in practice with Unity Catalog. We also highly recommend having a look at Unity Catalog best practices for a holistic understanding of data governance in Databricks.

Note:

* DBSQL warehouses are based on the “Shared Access Mode” architecture. Therefore, we haven’t explicitly mentioned the limitations of specific functionalities in single-user access mode (read more about the differences between these two). Especially for Row Filters & Column Masking, you will not be able to access a table via single-user cluster access mode that has a filter or mask applied (at the time this blog is published — please ask your account team for a private preview that addresses fine-grained access control via single-user clusters).
** with one caveat: system tables are read-only, and therefore, we have to make a copy to reproduce the previously used example. Supporting row filters and column masking for system tables is on the roadmap.

Sensitive Data Governance Design Patterns in DBSQL

Introduction

Design Patterns

Views

Data Masking

Dynamic Views

Row Filters & Column Masking

Conclusion

Written by Databricks SQL SME