Data is one of the most valuable commodities which companies have. As such valuable commodity they should be protected in right manner. Today’s world is full of data protection rules (GDPR, CCPA, HIPAA, etc.) which main aim is protecting user’s data. Such compliances specify restrictions and rules to be followed by data platform providers and data processors. Data security should be always number one priority in every data related project.
Snowflake by default offers many security features but we have decided to put the security on the next level in data access perspective. We want to protect our user’s data in open world of Internet and ensure that data will be secure no matter what might happen. From data security perspective we’ve decided to leverage more Snowflake features besides Role Based Access Control (RBAC) and increase our data security.
In this article i would like to point out key aspects of our security model in relation to Snowflake features and show how you can increase security of your data projects with Snowflake features like reader accounts, external functions, custom encryption or secure views.
Let’s start by quick explanation of our use case and need for this kind of security model in place. We have data app which is publicly available on Internet. Of course, users are first authenticated and then authorized to see right content but it is still not so common that you have BI application which is publicly available. Usually BI tools and apps are internal ones, hidden behind corporate firewalls and available only through VPN or internal corporate network. Very often data apps are mainly used only by internal employees to support decision making, influence company business or provide valuable insights from different domains. That is not our case. Our application lives in open world of Internet. We must be more concerned about user security because in case of any hack/leak/bug or gap in specs/testing/development it may end up with serious issue when user sensitive data will be publicly available to anyone (the worst nightmare) or to wrong user (still bad nightmare).
With new security model we want to ensure that in case of any issue, no matter if it would be human or machine / system error the data will remain secure and not compromised.
Let’s deep dive into the solution but first quick recap of Snowflake built-in security features.
Snowflake offers many security features which modern cloud data platform should have. Starting with encrypted all communication, through multi-factor authentication or federated authentication & SSO up to dynamic data masking or time travel feature and many more. All of them making Snowflake robust and secure platform ready for wide range of use cases when fulfilling enterprise demands for reliable and secure platform in the cloud. For all Snowflake security features please follow this topic in documentation and please be also aware that availability of particular security feature might be related to Snowflake edition. Not all of them are available in Standard edition.
So why do we want to build something special, more advanced if I am saying that Snowflake is robust and secure platform? Because we (developers) or humans in more general way are not perfect beings and making mistakes. We want to improve application level security and be sure that data in our solution will be always safe and protected also on application level not only on platform level. Snowflake offers many other features which can help you solve this. Let’s deep dive into our architecture and how we use features like custom encryption or reader accounts.
Our architecture leverages several features which do the trick when you put them together. The architecture itself should be general enough and as such it is reusable and can be used for different use cases. I can imagine that same approach might be useful in similar scenarios:
- Internal DWH where you need to distribute part of it (some table only or schemas) to different internal departments with solid data isolation requirement
- Internal DWH where you need to share different parts with different external partners. Today’s DWH solutions are full of data distribution pipelines
- Generally, all the uses cases where you need reliable data isolation for distributed data pipelines on top of common platform with standard access (native connectors, ODBC, etc.)
Reader Accounts & Data Sharing
Data separation has been one of the first options we have been thinking about in terms of increasing security. Our Data Warehouse consist of multiple areas and not all of them are exposed trough web application. We have also multiple layers (staging, working up to analytical tables) where only the top level is accessible by end users. We wanted to separate the publicly available analytical tables from rest of DWH as much as possible. Snowflake offers two features which makes such separation easy and without any additional cost and no administration or configuration overhead.
First feature is Secure Data Sharing where certain database objects can be shared with other snowflake account. Data are shared, not copied so there is no extra cost for storage and there is also not needed to replicate the data. Data shares are read only, and provider can simply manage the grants for objects available in the share. By data share you will create isolated security domain where is easy to disable all access. You have simple role-based access control design to maintain.
Because data sharing works only between Snowflake accounts you need secondary snowflake account which will be data consumer. And here comes the second feature called Reader account
Reader account will be data share consumer and as such it is again managed by the provider. You will get new URL endpoint for the reader account and you have access only to the objects which are part of the data share. It means your application will be connected to this new endpoint and there will be no relation to the provider account where you have data from your other domains or layers of your DWH. It is completely invisible from reader account.
Side Note 🔖
Can you imagine no more data exports and sharing the data over cloud storages or SFTP servers? Data sharing eliminates overhead of traditional legacy solutions, removing need to create several copies of data.
To wrap up, creating secure data share and reader account brings following benefits:
- isolated security domain
- read only access to needed tables only
- simple RBAC design to maintain
- easy to disable all accesses
We exposed data from primary account to our reader account in form of views because we do not want to allow users to have access to underlaying tables and because we have additional logic for decryption on top of those tables. Instead of “normal” views we use Snowflake secure views . You can’t even use normal views in data shares, they must be secure.
If the view is secure, it means that view definition is not available to the users and they can’t see the code behind the view — what are the tables, UDFs or any data transformation which is part of the view. This is another piece for improving security and privacy as there is not possible to backtrack anything from reader account back to your data in primary account. View definition is only visible to authorized users. You do not even have to grant the underlaying tables to be part of the data share. In our case the view definition contains calling of user defined function for data decryption, so this functionality is completely hidden in reader account. Users can’t see the function definition, input parameters, etc. It is not even obvious that there is any data encryption/decryption in place.
Custom data encryption
As I mentioned in previous paragraph, we use additional custom data encryption to increase security even more. By custom encryption I mean we encrypt all data in our final analytical tables and store them encrypted. Encryption is done on user account level. Each account has his own encryption key, based on logged user the correct key is retrieved from the key storage and used for on the fly decryption. This is done as part of the secure views which exposing the data in reader account. During data encryption implementation we have been facing some challenges to achieve similar performance as for non-encrypted data which was one of the key requirements for us. Encryption/decryption behind the scenes should not be noticeable for end users.
We use ENCRYPT_RAW and DECRYPT_RAW functions. Snowflake also has ENCRYPT and DECRYPT which were not usable for us because the initialization vector is always generated randomly so if you encrypt Hello two times with same key you will get two different encrypted values. Considering tables with hundreds of millions of records this would have significant impact on storage. In case of ENCRYPT_RAW and DECRYPT_RAW functions you can select the initialization vector yourself and then Hello will always have same encrypted value in case the same encryption key and initialization vector is used.
We have also faced some unknown Snowflake errors when using DECRYPT_RAW function directly in view DDL. Errors have had similar output:
SQL execution internal error: Processing aborted due to error 300002:1467007814; incident 2723036.
We ended up with ticket for Snowflake support and it seems to be some internal issue which they have been already working on. Meanwhile we’ve found a workaround. We have created UDF where the decrypt_raw function has been part of it, then our UDF has been used as part of secure view DDL and it works in this way — no more internal errors.
We have faced another issue with hitting the Snowflake cache for repeated queries. After implementing all the changes (data sharing, secure views, encryption, session variables for storing user info, custom UDFs, etc.) we have found out that we don’t use the Snowflake cache which has significant impact on performance. Snowflake has more different caches (Query Result cache, Warehouse cache). You can find more about it here 👉🏻 CACHING IN SNOWFLAKE DATA WAREHOUSE
It has taken quite a lot of time to find all root causes and then try to fix them. Let’s summarize what has been preventing us from using different caches because it might be useful also for others:
- use of session variables prevents Snowflake to use query result cache. This is because internally system checks the time when a particular session variables value is set. If it has changed since the last run, the query result cache will not be used even if the second run uses the same session values
- also using of UDFs prevents Snowflake to use the query result cache
Because using UDFs has been blocking the cache being used and we need UDFs because of internal Snowflake issue when using decrypt_raw directly in secure view DDL we have done some optimizations on app level — using separate data sources for filters and dashboards, more specific data source rather than general one, etc. Thanks to those optimizations we have been able to get to similar query times as before the changes.
Have you noticed?💡
Important thing which might be already clear from previous description about User defined function for Data decryption. It is created in primary account (same as secure views inside which it is used) and function is also secure! It is shared with reader account in order to be able to decrypt data there, but it is completely hidden from user perspective because it is only part of secure views.
What next ?
First phase is done. By implementing all the mentioned features, we have pushed our app security to next level and hopefully also increased data security. But there is more what can be done and make it even better or easier to maintain. So what do we plan to do next? Generally, we want tightly integrate Snowflake with various AWS Services for automation of certain tasks.
Snowflake has new feature (available in open preview) called External Functions. External function call executable code that is developed and executed outside the Snowflake. First, we would like to integrate AWS KMS for encryption keys management. Next, we want to integrate certain internal API for user authorization and make them available to Snowflake UDFs through AWS Lambda and AWS API Gateway.
This is little bit more technical topic and i would like to go through it in separate post as this one is already quite long. Who is interested, please stay tuned. 🤘🏻