Automate user onboarding and permission grants to Snowflake using Okta and Terraform

Published in

Super.com

11 min readApr 4, 2022

by: Jay Bujala and Jonathan Talmi

At Snapcommerce, we have a modern data stack consisting of tools like Airflow, Airbyte, dbt, and Looker. The central hub of our data platform is our data warehouse, Snowflake, which powers important operational, analytical, and BI workloads across the company. Interacting with Snowflake, either directly or through a BI tool, is a part of many teams’ daily workflows. Financial reporting is also largely driven off data in Snowflake.

Two of Snapcommerce’s core company values are Data-Driven and Open and Transparent. We believe in giving team members a high degree of visibility into our data, while still keeping controls in place to prevent improper access. Nearly every new employee gains access to Snowflake and/or Looker and is encouraged to explore data on their own.

By the latter half of 2021, Snapcommerce began facing some challenges related to access control as the company grew both in size and data surface area. These included:

Manually provisioning and de-provisioning Snowflake user accounts to support our rapidly growing headcount was time-consuming and inefficient
Some roles were used too widely or too narrowly, and developers often did not know which roles to use to access which data.
Lack of version control and automation around access control led to mistakes, inconsistencies, and bottlenecks on the data team.
Lack of traceability and auditability for Snowflake permissions for compliance

As the company matured, it became increasingly important for us to set up an auditable, code-first system for managing access to data. Snowflake, to its credit, has a powerful and complex, yet sometimes convoluted role-based access control framework (RBAC). This blog post will cover four initiatives undertaken by Snapcommerce to address the challenges above within Snowflake’s RBAC:

Migrating user role management to a SCIM-compatible identity provider (OKTA)
Creating a role hierarchy following best practices
Automating access control using terraform
Monitoring and alerting for permissions updates

Migration to OKTA

The first initiative undertaken by the data team in this project was automating the provisioning and de-provisioning of users using Okta. Until that point, the process for managing access to Snowflake was manual. Let’s go over the steps needed to give a new employee access to Snowflake before Okta.

Gary, the new operations manager for Snapcommerce, wants to go deep into the weeds of customer support tickets and requests. He requests access to Snowflake by sending a Slack message to the #analytics channel. A few hours later, a data engineer with admin access sees the message and runs a SQL script in the Snowflake UI creating a user account for him. After a brief discussion on the level of access required, Gary is granted the appropriate role. Meanwhile, the team realizes they never deprovisioned the account of the former operations manager, and their account is promptly deactivated.

As Snapcommerce grew its headcount, timely provisioning and de-provisioning of Snowflake accounts became a challenge. The manual process outlined above was inefficient, time-consuming and risky. At the same time, the company was beginning to outgrow its existing password management tool and was looking to onboard an Identity and Access Management (IAM) SaaS that could simplify access management to many different internal and SaaS applications. After going through the vendor selection process and POC, the company settled on Okta. Alternative solutions to Okta that support SCIM integration with Snowflake which were considered include GSuite, OneLogin, and Auth0.

Okta facilitates user creation, profile updates, password syncing, and user deactivation in Snowflake. A data steward can perform all of these user administration tasks without needing to manually run SQL commands. The Push Groups functionality in Okta automatically creates roles in Snowflake with names set to the group names in Okta, and grants these roles to the group users. Now, Gary’s Snowflake account is automatically created when he joins the company and receives the same level of access as the rest of the operations team by default.

Roles and Privileges

Okta vastly simplified the process of provisioning and deprovisioning Snowflake user accounts and mapping Snowflake roles to different teams. With that in place, we began revamping our role hierarchy according to best practices.

In Snowflake, privileges on key database objects like tables, views, and schemas, are granted to roles, which are then assigned to users. One of the most important goals of this project was a clear and intuitive role hierarchy that made accessing data and setting up new tools and services simpler.

Allie, a software engineer on the Platform Engineering team, often writes data pipelines or builds services that publish raw data to Snowflake. During data pipeline development, she often had to change roles to access staging data. When she set up a new Kafka service, the data team created a new role for her and granted it privileges to specific objects manually in the UI. Not only was this process manual, but it created inconsistencies and confusion in terms of how to access which data.

We used several frameworks as inspiration when designing our new role hierarchy, including dbt Labs’ “How we configure Snowflake”, Gitlab’s open-source analytics repo, and Snowflake’s own documentation.

We had a few guiding principles:

Users should have read access to most datasets to enable self-serve analytics wherever possible
Applications and services should have read and write access to the least amount of data required to serve their functions
Expanding or shrinking the scope of roles, or adding new ones, should be straightforward.
Roles should not be reused across applications and services

Data Access Roles

Every employee receives the SNOWFLAKE_ANALYST role by default, which is owned and managed by Okta. This role has read access to all dbt models and raw data where appropriate. Additionally, this role has write access to sandbox environments like the ones used by analytics engineers and data scientists. This role can be easily be subdivided and expanded upon as our business continues to evolve and grow.

Service Roles

So far, we have focused on user-level data access. But what about the plethora of internal and external services and applications that also interact with Snowflake and require different access profiles? Service roles are required for usage by non-human application or service user accounts, like ELT SaaS tools, BI tools, and any internal services that move data in or out of Snowflake.

We used the principle of least privilege when designing service roles. These roles can be granted read access on a narrower set of resources than human users, which can have the byproduct of enforcing desirable development patterns. For example, the SNOWFLAKE_REPORTER role (Looker) could be limited to data transformed with dbt, so that all reporting and dashboards come from version-controlled data models. Similarly, the SNOWFLAKE_HIGHTOUCH role (reverse ETL) could be limited to modelled data as well, which means reverse ETL pipelines can be efficiently tracked as exposures. Finally, the SNOWFLAKE_ATLAN role (lineage and data cataloging) should have access to all raw and transformed data, including user-level sandboxes to assist in the development workflow.

A different set of services requires write access to some databases and schemas, primarily for data ingestion and transformation. ELT tools like Fivetran and Airbyte require schema creation privileges when setting up new connectors, but do not require read access outside of these schemas. dbt requires read access to all raw data and write access within the transformed database. Each of these services receives a distinct role with only the privileges required to do their core functions.

Managing Roles and Permissions using Terraform

Now that we had defined our role hierarchy, with support for users and services, we still wanted to make it easier for data platform users like Allie to use this framework. We looked at two popular tools for automating RBAC in Snowflake: Permifrost, a python library created by Gitlab and currently maintained by Pedram Navid, and a terraform provider created by the Chan Zuckerberg Initiative.

Terraform is widely used to manage cloud infrastructure across both Snapcommerce and the industry as a whole. The terraform “plan” is a non-destructive feature that lets a user dry-run a configuration and produce a plan with proposed infrastructure changes before execution. The terraform approach for managing permissions has the following benefits:

Consistent creation or deletion of resources Snowflake supports like users, roles, databases, schemas, warehouses, tables, views, grants (yes, a grant is also a resource in this context of terraform), and so on using code.
Ability to dry-run and see what resources will be created, updated and deleted before applying a config. This is usually done and displayed in a CI job for every pull-request.
Easy replication of a Snowflake configuration within the same account (e.g. for a staging/test environment) or to a new account (e.g. setting up an account in a new region)

Permifrost has a similar feature-set, like dry runs, but we felt terraform is a more natural choice to integrate into a CI/CD pipeline, given that the company uses terraform heavily in other projects, and we wanted to align with the company’s overall DevOps direction.

Our Approach

Since we had a large number of existing schemas and tables, we decided to leverage the Chan Zuckerberg Initiative provider for permissions initially, but not for the creation of actual Snowflake objects. In the future, we will import existing resources like tables, schemas, and databases, into terraform so that it becomes a more holistic solution.

We started out writing terraform code for declaring database grants, like the one below:

Terraform resource code block for database grant

While it’s fairly straightforward to write a config file that declares the target state of Snowflake resources, as the number of resources (in this case, roles and grants) grows larger, maintaining such plain terraform configs becomes challenging and it is hard to keep the code DRY.

We changed our approach to a human-friendly YAML config representing permissions, which terraform parses to declare resources (similar to the Permifrost approach). This allows team members to self-serve access grants by changing the config through a PR without needing to know terraform.

Here is a sample config that declares permissions for an existing role (pushed through Okta), and the designation of admin users.

Permissions Config

While it’s recommended to follow the principle of least privileges, that requires the configuration to be very granular. For our initial setup, we chose a design that offers high-level permissions like “read” and “write” for each Snowflake resource type. Of course, each role can have its own set of permissions under “read” and “write” for each object type. This means that the “read” and “write” permission sets are defined at the role level rather than for specific Snowflake database objects.

In the example config above, we are granting the following permissions on databases to SNOWFLAKE_ANALYST:

USAGE and MONITOR on DB_PROD and DB_DEV
CREATE SCHEMA and MODIFY on DB_DEV

The same applies to schemas and warehouses. In our current implementation, “future” grants are applied by default on each object type.

One limitation of the Snowflake terraform provider is that it does not support grants on “all tables” within a schema or database. This means that the out-of-the-box terraform provider requires one to add snowflake_table_grant resources for all tables in the configuration, which is extremely prohibitive for a fast-moving data team with a growing data surface area. We explored generating a list of tables on the fly during each CI pipeline and looping through that list to create table_grant resources, but this takes a long time for thousands of tables and views.

To solve this missing piece in the puzzle, we implemented the following workarounds:

Future grants to “select” on tables, views and materialized views are automatically generated for those databases and schemas that have “USAGE” grant in the config.yml
A post-hook CI job that runs a dbt macro to grant select on existing tables and views. The trigger for this deploy job is set to manual and is run only if needed.

Here is a glimpse into a terraform plan that was generated for a merge request to grant read access to the airbyte_schema database, to a bunch of roles.

Let’s work through an example. The product team wants to load event data from Amplitude into Snowflake so it can be combined with transactional data, enabling deeper analytics. To set this up, Allie creates a new Snowflake role specifically for this workload. She makes changes to terraform configuration file managed in the same repository used for dbt and submits a merge request as shown below.

The CI pipeline runs integration tests and generates a new plan, which Allie and reviewers verify in the UI. Once approved, she merges in the changes and manually triggers a “terraform apply” job in the CI pipeline to apply the new plan

CI Pipeline for the PR to update permissions config

Monitoring Changes to Permissions

With grants now managed through code, the data team could move fast to unblock ourselves and others. We wanted visibility into how our framework was being used as well as alerting when improper permissions are granted. To achieve this, we set up a simple Airflow DAG that sends slack notifications whenever permissions are granted or revoked.

The following query was used to detect changes to permissions.

While these alerts are simplistic, they can be easily adapted to fit more specific use cases, like if ACCOUNTADMIN or SECURITYADMIN are granted to new users.

Key Takeaways

Defining and granting permissions for all users at scale was a challenging exercise. Together, Okta and the terraform made it easy to automate and scale provisioning / deprovisioning of Snowflake user accounts, granting permissions to roles, and granting roles to service accounts. The main outcomes of this project were:

Automated provisioning and deprovisioning of user accounts using Okta
A comprehensive, intuitive, and scalable RBAC framework informed by best practices
Creation and maintenance of roles and their access privileges using human-friendly YAML configuration (IaC).
Better auditability around data access — amendments to privileges for a role go through a Pull Request (PR) that is subject to review and approval from the data owners.
Visibility into permissions updates through monitors and alerts on top of the query log

This framework massively reduced the manual effort of the data team, while adding extra layers of security and auditability to our data platform.

Acknowledgements

Thanks to Gage Sonntag, Alex Ianus, Kyle Pericak for reviewing this blog post and offering great suggestions for improving it. Additional thanks to Kevin Chan for numerous code reviews while developing this framework, and our awesome Dev Ops team for getting us set up with Okta.

Passionate about tackling challenging data problems using a modern data stack? Please visit our careers page for more information on currently available Data and Analytics Engineering roles.

If you want to read more — check out and follow our Snapcommerce publication here on Medium.