Secure your sensitive data with External Tokenization using Snowflake External Functions — Part 1

Ritabratasaha
6 min readOct 30, 2022

--

Every Enterprise Application faces the challenge of information security while protecting sensitive data. In this article we will try to explore how a security admin can leverage Snowflake while trying to protect his sensitive data. If you are already using Snowflake and have a requirement to implement tokenization as part of your Data Security initiative, please read through this before onboarding a new purpose built tool in your IT landscape because that comes at a good price.

I have decided to split this article in two parts. The first part will cover the basics of the system setup and the Second part will focus on making the setup more scalable and possibly bring in enough automation to extend this to other objects and attributes in the database.

Data protection has been addressed by different cryptographic techniques till date. Encryption is the most widely used technique that is adopted for this purpose. When implemented in the right way it is one of the most effective controls that is available for the security admin. It not only renders unreadable information to unauthorised users but also protects data at rest and in motion. Worth to mention that in Snowflake all data is encrypted by default. Tokenization is another method that complements Encryption. It replaces the original sensitive data with non-sensitive or synthetic placeholders which are non reversible in nature unlike Encryption.

This article will not deep dive into the details of these techniques mentioned above or debate on which one is better over the other. When asked to choose between the two a security professional will always respond with “it depend” to this ask which is quite reasonable because application of both these techniques have their own pros and cons based on the use-case under consideration.

Tokenization is very frequently applied in the digital payment processing use-case where sensitive information like credit card details or bank account numbers are securely stored in a separate token database server where it is encrypted and not in the application database itself. With this use-case in mind I will try to design an architecture to tokenize credit card information using Snowflake External Functions, AWS Lambda and AWS Dynamodb as the token database. If you are new to creating external functions on Snowflake I would request you to visit the product documentation for details.

Let’s try to look at a basic tokenization architecture below.

I am assuming that the process flow is self explanatory. Hence I will quickly try to fit this in our Snowflake way of implementing in the next architecture diagram.

Lets review the current infrastructure provisioned for the experiment :

I have generated sample customer data from this website. You can see that the credit card information is clearly visible. My aim is to tokenize this information using an external function and ensure that the original data stored in the token database is revealed only to authorised users.

Lets create two external funtions 1. ext_write_token & 2. ext_read_token. The first function creates the token from the real value and replaces the original card number with a generated token where as the second function has the ability to retrieve the original card number from the token database and make it visible for authorised users. To ensure that the original value of the card is revealed only to authorised users I am using Column Masking Polices in Snowflake. For the purpose of my experiment I have used the popular UUID python package to generate random values which are well suited to be treated as unique tokens.

Here is the definition of the two external functions:

Create Two External Functions for Tokenization and Detokenization

These external funtions will invoke an API Gateway which will trigger the lambda functions. One of them is the token creator (tokenizer) and the other one is the original data value retriever (detokenizer).

Here are two very lambda functions below :

## TOKENIZER ##
import json
import uuid
import boto3
def lambda_handler(event, context):

return_text = []
client_dynamo=boto3.resource('dynamodb')

for rows in event['data']:

tablename = rows[1]
pkey = rows[2]
attrname = rows[3]
attrvalue = rows[4]

unique_key = str(uuid.uuid4())

table=client_dynamo.Table(tablename)
response=table.put_item(
Item = {
'id' : pkey,
'CreditCard' : attrvalue,
'CreditCardToken' : unique_key
}
)

return_text.append([rows[0],unique_key])

return {
'statusCode': 200,
'data': return_text
}
## DETOKENIZER ##
import json
import boto3
def lambda_handler(event, context):

return_text = []
client_dynamo=boto3.resource('dynamodb')

for rows in event['data']:

pkey = rows[1]
attrname = rows[2]

table=client_dynamo.Table('customer_attribute_tokens')
response=table.get_item(
Key = {
'id' : pkey
}
)

return_text.append([rows[0],str(response['Item'][attrname]).strip("\"")])

return {
'statusCode': 200,
'data': return_text
}

We are all set to see the effect of the tokenization process on the original data. However before we do that lets try to define who is athorised to see the actual value. I have created a database role : “Masking_Admin” who is authorised to see the original value and anyone else will always see the tokens and not the real data. Below is the definition of the policy that makes this happen.

Here is the final result. When you login using the role “Masking_Admin” you can see the real credit card values because as per the policy this role is authorised to see the real data.

However on logging in with any other role (in this example its the Accountadmin) you see the credit card numbers being replaced by tokens.

Now here is a snap from our token database, dynamodb:

Before I end this part of the article, here are few thoughts for the upcoming one.

  1. I have experimented this with 100,000 records to start with. However I could see dynamodb on-demand throttling both for write and read at the 5000 mark. Though I don't see a reason why an admin would like to pull more than 100 credit card information at one shot, the system should offer the scalability to a report at least 5% of the whole dataset at any given point.
  2. Situation might be different for a data analyst might want to see all the real values of the sensitive attribute for analytics. But this request need not be fulfilled at real time as the analyst can wait for few mins until the tokens are replaced by the original data in the background. This calls for asynchronous operations with necessary error handlings.
  3. If you have noticed I have tokenized a single sensitive data element in the source dataset. What if there are few more? Do I need more policies and more dynamodb tables to address this or there is possibility of automation to make this scale dynamically at-least at a dataset level ?
  4. Design of dynamodb table (token database ) depends on the reading pattern of the data. Should we try to automate the design of new tables as well? As I type I don't think so because you generally don’t have very dynamic tokenization requirements or in other words you don't onboard new sensitive data elements every day, do you ? So trying to automate new table design might be an overkill from ROI perspective.

Well, while I have these questions in mind, I am still looking for ways to address them. Let's see how these can be addressed in the next part of the article.

References :

  1. https://cdn.ymaws.com/www.members.issa.org/resource/resmgr/journalpdfs/choosing_tokenization_or_enc.pdf
  2. https://securosis.com/assets/library/reports/Securosis_Understanding_Tokenization_V.1_.0_.pdf

--

--