PII data privacy in Snowflake

Ritika Srivastava
Voi Engineering
Published in
11 min readJun 13, 2023

At Voi, we use Snowflake as our data warehouse where we import data from different internal and external sources. Some of those sources contain PII data which we must limit the exposure of, such as details of our riders e.g. name, phone, email, geolocation, etc. This post will cover our work with this. If you prefer listening to this in video format instead, please go ahead and take a look at our Heroes of Data Meetup.

Agenda

  • What is PII and why does it need to be protected
  • Voi’s PII journey; from idea to reality
  • PII implementation details
  • Areas of improvement and plans for the future

Who is Ritika?

Ritika is a Scientist turned Data Engineer. She started her career as a Research Scientist at big Pharma, and eventually fell in love with technology and data. She has worked extensively with big data, data pipelines and monitoring, data quality, and data governance. Currently, she works as a Data Engineer at Voi in their Data Platform team. When not working, she loves to spend her time traveling, trying out adventure sports, and indulging her taste buds in exploring different cuisines.

What is PII and why it needs to be protected

Personally Identifiable Information (PII) is primarily a term used in the US, while GDPR has a broader definition and calls it personal data instead. For more information, refer to the official definition: https://gdpr.eu/eu-gdpr-personal-data/

In short, personal data is any information used to identify an individual directly or indirectly. It can be information like your name, phone, email, biometric data, or any other sensitive information.

PII implementation journey; from idea to reality

Firstly, how did we approach the problem? Data discovery

We knew that it was not going to be an easy thing to roll out PII masking overnight so here is how we approached the problem:

Before you can protect your data, you need to know what data to protect. The first step was to identify tables in Snowflake that contain PII fields, then we listed these separately.

Next, gather requirements for PII masking: Snowflake only?

Next, we had to gather requirements for masking PII.

For us, one of the main factors to consider here was if we should mask PII in Snowflake only, or from our BI tools also. For example, Tableau and Metabase which in our case read data from Snowflake. Since we have many users in Tableau and Metabase, we decided to also mask PII from BI tools.

Then, PII access survey sent out to all Snowflake, Tableau, and Metabase users

To get an understanding of how many users handle PII data in their daily operations, we sent out a PII access survey. All Snowflake, Tableau, and Metabase users who were active in the last 90 days got this survey to specify what tables/reports containing PII they use frequently.

We used this information to plan the PII masking rollout in a non-disruptive way and know which user groups need access. Our idea was that this would minimize the workload on the data teams around handling of PII requests after we have rolled out masking.

Time for Brainstorming sessions

The survey was up and running for 2–3 weeks. In the meantime, we held several brainstorming sessions to explore the possible solutions for masking PII from Snowflake, Tableau, and Metabase.

POC on one table

Once the solution was finalized, we did a Proof of concept to mask PII on one of the tables in Snowflake. I’ll discuss the implementation details later in this article.

Then, after evaluating all responses from the survey, we granted PII access to the relevant people.

Then, PII access user guide

With a solution in place to mask sensitive information, we still had some remaining things to solve.

We needed a process in place for users to ask for PII access, so we started working on this PII access user guide to streamline the process of requesting and accessing PII data. As part of this initiative, we have introduced a Data Request Portal, which serves as our internal service desk. Within the portal, you can find a dedicated section specifically designed for submitting PII access requests. Here, you must provide mandatory information about why PII access is required, what PII fields you need access to, the name of the tables containing PII data, and in which service(s) you need to access PII data.

Once the request is approved by your line manager and our Head of Data Analytics it moves to the relevant data team’s Jira board, so that they can grant the access. Furthermore, we had to create documentation for solution implementation, process for masking and tagging PII columns in new datasets, the PII access user guide, and the process for data teams to grant PII access in different tools.

Finally, PII masking rolled out

Finally, when we had done our homework, we rolled out the masking gradually, table by table during 2–3 weeks to minimize disruption in day-to-day operations.

We started off rolling out masking for less frequently used tables and then gradually moved towards the more frequently used tables.

Summary of steps:

  1. Data discovery
  2. Gather requirements: Snowflake only?
  3. PII access survey sent out to all Snowflake, Tableau, and Metabase users
  4. Brainstorming sessions
  5. Exploration of possible solutions
  6. POC on one table
  7. PII access user guide: Process for users to ask for PII access
    -Data Request Portal
    -Why do you need PII access?
    -PII data fields that you need access to
    -Name of tables/reports containing those data fields
    -Where do you need to access PII data?
  8. Documentation
  9. PII masking rolled out table by table over the next 2–3 weeks

Implementation details

Let’s discuss the actual implementation of PII masking.

During our investigation of PII masking, we found Snowflake’s dynamic data masking (DDM) feature that is natively supported and available for all enterprise customers. In short, DDM is a column-level security feature that uses masking policies to mask sensitive data in a table at query time based on the user’s role. For us, this solution was a great fit as we didn’t have to make copies of data, which could have resulted in an increased codebase, and we also saved on resources.

This flowchart shows a high-level overview of how to use Snowflake and dbt to mask sensitive data.

  1. First, we create masking policies in Snowflake for PII columns.
  2. Then, we create roles in Snowflake that can access sensitive data.
  3. Finally, we grant these policies/rules to PII columns and this is done with the help of a dbt feature called macro.

What are column masking policies?

Masking policies are schema-level objects and rules that allow control of what different users can see when looking at the same dataset. This means that you only need one dataset and depending on the permissions set for their role, users can either view masked or unmasked values.

In Snowflake, sensitive data is not modified in an existing table and doesn’t use static masking. Instead, when users execute a query for which a masking policy applies, the masking policy conditions determine whether unauthorized users view masked, partially masked, or unmasked data. You can apply these masking policies to individual columns in any model. Depending on your use case, you can have masking policies for different types of PII columns or use the same policy on different columns, provided the columns have the same data type.

Here is the code snippet for the masking policy on the phone number.

The code works like a case when statement. It states that if you have any role that inherits from PII, the actual value is visible, otherwise, you’ll only see part of the number. The first 4 digits of a number are visible, while the rest is masked by an asterisk. In the else condition, you have the option to display anything besides the actual value. This means that you can use any regular expression or any logic depending on how you want the masked column to look for an unauthorized user.

Snowflake Roles Hierarchy

Now I’ll briefly talk about the new roles we use for column masking.

The “Snowflake Roles Hierarchy” image shows a simplified before and after picture of our role hierarchy in Snowflake. An arrow that points towards a role indicates that this role inherits all privileges from every role beneath it. Meaning it can do everything that the other roles beneath it can do.

Let’s focus on the roles developer and reporter. The developer role is used for developing models or tables in Snowflake using dbt, and the reporter role is used by Tableau and Metabase to read data from Snowflake. To handle PII masking we created an abstract role called PII. This role in itself does not do anything, but any role that inherits from this role can see personal data.

Before implementing PII masking we had developer and reporter roles that had access to view all data. Now, we have the roles developer_pii and reporter_pii that can access sensitive data, while developer and reporter do not have access to any sensitive data by default. Anyone who needs access to personal data must send a request through our Data Request portal, as per the defined process.

The data masking macro

After we decided to implement the Snowflake DDM feature, we needed a way to scale it across all our dbt models. To do this, we created a simple and easy to use dbt macro called apply_column_masking. Macros are a utility in dbt that help you write generic code that can be reused across your dbt project. They are analogous to functions in any programming language. You run the macro as a post hook in the config section of a dbt model, and pass the masking policies and desired PII columns as a dictionary. The keys in the dictionary are the names of PII columns and the values are the names of the applied masking policies.

This is the part of the macro that performs the PII masking for us. Basically, it runs an alter command on each column in the macro’s input list and applies the appropriate masking policy to it. In this example, the macro masks name, phone number, and email columns.

What did the apply_column_masking macro do?

After running the model with a post hook that invokes apply_column_masking macro for name, phone, and email fields; then, if you query the table in Snowflake, you’ll notice that only roles that have access to sensitive data, which in our case are developer_pii and reporter_pii can view the actual data. For all other roles, you’ll see masked values based on the logic set in the masking policy applied to that specific column. For example, where the phone number masking policy applies, unauthorized users can only view the first 4 digits of a number, whereas the rest are masked by an asterisk. In the case of a name, only the first letter is visible and the rest is masked by an asterisk. In the case of email, we chose to only show the domain name to unauthorized users.

Add meta tags in dbt to tag PII columns

With dbt you can generate documentation for your dbt project and render it as a website called dbt docs, which is the data catalog dbt offers out of the box.

We at Voi are looking into some advanced options like DataHub, Amundsen, etc. –but this is what we have for now. To enrich dbt docs with PII information, we use meta tags to tag PII columns in dbt models. These meta tags are a property of dbt that are used to set metadata for a resource and when you do that, the metadata information is visible in the generated dbt docs.

In the first screenshot, we have added meta tags to the account model in dbt and to the PII columns in the account model which are visible after generating the dbt docs as displayed in the second screenshot. This is only a part of dbt docs that shows that the account table contains PII, but it also shows the same info on the column level for the name, phone, and email fields.

In addition, you can use meta tags for other use cases. At Voi, we use meta tags to tag model owners and team owners for individual dbt models. We also have a pipeline that gets this metadata into Snowflake and a dashboard built on top of it, that tracks team owners and model owners for individual dbt models. Going forward, we want to use this metadata to decentralize dbt alerts to team-specific Slack channels.

Areas of Improvement and Plans for the Future

This is an ongoing effort and there are still a lot of areas where we can improve. A high priority is to make the entire setup more user-friendly. Currently, the analysts must call the macro manually from the config section of a model and tag PII columns of that model. Instead, we want to automate macro invocation on tagging PII columns of a dbt model so that the analysts don’t have to manually invoke the macro themselves.

We also want to set up some sort of PII monitoring on data to track what PII columns are being masked and in which tables.

In addition, we want to explore and evaluate the possibility to automate masking end-to-end. For example, through Snowflake’s Data classification feature which automatically classifies and tags PII columns and other features that automatically mask PII based on tags.

Something else that would be worthwhile is to look into Snowflake’s External tokenization feature, which can tokenize data before loading it into Snowflake and then detokenize the data at query runtime. We think this could also help handle any“right to be forgotten” GDPR requests. Time-based access for users that ask for PII access is also very interesting. Currently, users receive indefinite access, which we, for security and data privacy compliance reasons, want to restrict.

Furthermore, we want to explore additional security features like row-level security for more granular access.

Finally, I would like to point out that if you have a use case at your company where you want to join on a PII column, you can use a hash function instead of an asterisk in your masking policy for PII columns. This is because asterisks don’t work with join and group by operations. At Voi, we didn’t have such a use case, so we went ahead with asterisks instead. If your use cases look different than ours, I suggest that you keep it in mind.

Btw, we’re continuously on the hunt for talented data individuals! Take a look at our career site for vacancies within your area of expertise.

--

--