6 Steps to Secure PII in Snowflake’s Cloud Data Warehouse
How to Use Built-In Snowflake Security Features to Secure Personally Identifiable Information (PII)
by Paul LaCrosse
In today’s data-driven business culture, we trust that the organizations and systems that store our Personally Identifiable Information (PII) are secure and that any data that could potentially identify us as an individual is locked away.
A wide variety of industries collect and use PII data as part of their day-to-day business process including retail, healthcare, transportation, government agencies (local, state, federal), and financial services to name a few.
While most businesses won’t ever come close to Amazon’s scale whose Prime Members Ordered More Than 2 Billion Products for 1-Day or Sooner Delivery in 2018, every time we order anything online, submit a credit application, open a financial services account, check into a hospital, pay taxes, schedule an Uber, see our doctor for a cold, or even take a Lime or Bird scooter ride, our PII data is provided to those organizations.
The Challenge of PII
The United States’ GSA states that “The definition of PII is not anchored to any single category of information or technology. Rather, it requires a case-by-case assessment of the specific risk that an individual can be identified. In performing this assessment, it is important for an agency to recognize that non-PII can become PII whenever additional information is made publicly available — in any medium and from any source — that, when combined with other available information, could be used to identify an individual.”
Any particular piece of this information may be sensitive, or non-sensitive, depending upon whether the unencrypted information may be simply discovered from publicly available sources. Further, sensitive PII is that which, if uncovered, may result in harm to an individual. Generally, things such as unique ids (SSNs, etc.), medical, biometric, and financial information fall under this category.
Snowflake Enterprise for Sensitive Data (ESD) provides a feature set which allows its users to easily protect sensitive PII, even allowing the possibility of meeting HIPAA regulations where required. Less stringent protections may be met using only the standard Snowflake DB offering.
Since security is a top priority in the Snowflake Architecture, every version of Snowflake fully encrypts all information on disk, always. It also uses industry-standard Transport Layer Security (TLS) to protect data during ingress and egress using any of their drivers.
On to the 6 Steps
For this post, we’re going to show how to convert a hypothetical business database, using only six steps. In order to focus only upon the relevant steps, we’ll simply state which tables have columns which have been determined by your organization to contain PII, or sensitive personal information (SPI).
Our demonstration begins with a database containing three schemas, each covering a particular business area. As would be the case in almost any business, assume that hundreds of queries are already in use, accumulated over years.
How can we best meet PII requirements during our transition to the Snowflake cloud database? And, can it be done in the least disruptive way possible? After all, nobody wants to have to completely rework hundreds of queries, while triaging endless business disruptions, during this transition.
Step 1: Locate All Columns Containing PII
By necessity, this step will require intimate knowledge of your database and business processes. For instance, do your tables all utilize a simple, sequential integer to act as the row ID (surrogate key), or do some of your tables make use of actual data for handling joins?
For our example, we’ve skipped ahead to the end result, highlighting above any table that we’ve determined contains any PII, whether all columns, or only a single one meets the definition.
Step 2: Segregate Tables Containing PII
Now that we know where the “problem data” lies, all we have to do is move the tables which contain it, out from their original schema, and into a new PII_SEC schema which we’ve just created to hold them.
The exact commands to do this for your situation will vary, of course, but I’m certain you get the idea. Make certain that your new schema does not have any default grants, as you want these new tables to be inaccessible unless you explicitly allow someone to interact with them.
After the tables have been moved, you’ve theoretically now protected all of your PII data. Congratulations! But, at this point, your database is now completely unusable for everyone, since nobody’s queries, which referenced these tables in any way, work anymore.
Step 3: Protect the Segregated Tables
Since completely inaccessible information isn’t what we’re going for (that would be a job for BleachBit), we’re now going to create a new role, inexplicably named PII_ROLE.
Grant permissions for this role to each table you’ve moved into the secure schema. In the alternate, you could’ve created the role and granted permissions on the schema, allowing them to be inherited as you created and populated your secured tables within.
Now, just grant this role to all of your Very Special Users, and you’re good to go! Right?
Not so fast. What about everyone else that doesn’t have a seat on the Lear?
Step 4: Replace Original Tables with Masked Data
Since we just broke that database for everyone that is not in the new PII_ROLE, we need to provide them with the secured substitute, where the PII columns show a mask string, instead of the real data.
Just as we created a schema for PII_SEC, we’re now going to create another one, PII_MASK. Now, we’ll add some data to this schema. But, it won’t be a limited copy of the tables in PII_SEC. Instead, we’re going to make use of the handy Snowflake feature called “Secure Views”.
The link provided expands upon when, and when not, to utilize a secure view. The short answer is to use it when protecting information, such as PII, but not for views which are merely conveniences for querying. All that is required, over creating a regular view, is to insert the “SECURE” keyword into the DDL, prefixing the “VIEW”. Example: CREATE SECURE VIEW …
You may be thinking to yourself “Why bother with this, isn’t a regular view good enough? What’s different?”
The simple answer is that some query optimizations are turned off, altering the order of evaluation for portions of the query. This in turn then prevents users from discovering data values, indirectly, via a trial and error process, by noting what errors are returned from Snowflake. More detailed examples are provided here. This feature of Snowflake is critical to providing maximum protection for your PII.
Step 5: Grant Regular Users Access
For our case, we’ll also create a new role, called NON_PII_ROLE, and add users to it — even the users already in PII_ROLE. This will allow them to switch their use just by sending a USE SCHEMA.
The key takeaway here is that we have two schemas, which both present the same data template.
The secure one has the real tables, with columns all containing the true data, whether PII or not. Instead of real tables in he other schema, secure views are substituted in their place. We took care that for each “real table” in the PII_SEC schema, we created in this alternative a secure view with the same name, emitting the same columns, in the same order, where each has the same data type as the real table.
Why is this so important to do? Ease of use, in that, from a SQL query point of view, we want the exact same SELECT statements to work in the same manner, whether we wish to view the actual values (and are members of a role allowing it), or, for general day-to-day purposes, the PII is masked, instead showing a replacement string, projected out by the view.
Step 6: Use Your Newly Compliant Database
At last, we arrive at the fun part! We’ve just used a simple, minor reorganization to a minimal portion of our database, to fully implement our PII protection plan.
As you can see illustrated above, the best part is that conversion of our existing queries will involve, at most, simply changing which schema is the session’s default, and removing the schema specification from the tables which contain any PII.
This means that every query, once transformed, should work for all users.
Just by changing their default schema, you can control whether they are able to view the real data, or a masked cover. Individual users that have permission for both can run their same queries, and control their output using only an intervening USE SCHEMA statement.
The amount of tedium is vastly reduced in your conversion, since the above process is very amenable to global search and replace (or egrep, xargs, and sed for you command-line aficionados).
Ah, SQL nirvana, you’ve never looked so good!
What About the Future
Just as you’re kicking back and celebrating your successful project, delivered early and under budget, someone bursts into your office with the newest emergency!
“Hey, you know that new database you brought up? It’s too restrictive! The V.P. wants to demonstrate what it looks like for a secure user, but without showing our real customer data. We can’t figure out what to do. Only XXXX is shown everywhere, or we get the real info. Oh, and the worst part is, the demo is already scheduled for next week. As they say, you need to make it happen”
The same people who swore last month that they’ve already provided you every possible business use case they can think of, are back again with another challenge. How much pain is going to be involved this time?
Luckily for you, the approach you just implemented is extendable, with almost zero extra effort, and with no disruption to any of the work you’ve already completed!
As alluded to in the illustration above, the “swappable default schema” approach isn’t limited simply to the binary choice of “real or masked”. Any number of additional alternatives may be put into place, at any time.
Have a user that doesn’t want to see protected column data as mask characters (ex: “XXX-XXX-XXXX”)? They want NULLs instead? Just create another schema named PII_NULL and create a set of secure views there which return NULLs instead of mask strings. Implement the emergency demo request in the same manner, with a PII_DEMO schema, where the secure views are written to select from some random, but realistic-looking, information.
To use these new options, the only change required is to the USE SCHEMA to utilize a different default. No database copies, SQL finagling, client code #IFs or equivalents required. All the existing client code, from simple SQL queries, to elaborate front-end applications, should work, as long as there is a means to have it use the right schema.
You’ll be back to your, now expanded, celebration before you know it, dreaming of your next office upgrade. At this point, maybe you’re even secretly hoping for more “emergencies.”
I hope that I’ve helped you, in six steps, learn how to secure your PII data using the built-in Snowflake security features, specifically Roles, default Schemas, and Secure Views.
Without the additional burdens of complexity and management that traditional solutions force customers to take on, you may turn the potential quagmire of PII into a simple, elegant, and extensible approach with Snowflake’s Cloud Data Warehouse.
Other Snowflake Stories from Hashmap
If you’d like insight into other Snowflake topics that the Hashmap team has blogged about recently, please check out both Snowflake’s Cloud Data Warehouse — What I Learned and Why I’m Rethinking the Data Warehouse and Quick Tips for Using Snowflake with AWS Lambda.
Also, at Hashmap, we provide consulting and migration services and solutions to customers across industries that leverage the simplicity, immediate elasticity, pay-as-you-go consumption model, and all-in-one cloud data warehouse approach that Snowflake delivers — we’d be glad to assist you in your Snowflake and cloud journey.
Feel free to share on other channels and be sure and keep up with all new content from Hashmap here.
Paul LaCrosse is a Consulting Systems and Software Architect, with more than 35 years of experience in solving some of industry’s most difficult problems. He provides Data, Cloud, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high value business outcomes for Hashmap customers.
Be sure to catch Hashmap’s Weekly IoT on Tap Podcast for a casual conversation about IoT from a developer’s perspective.