Terraforming Snowflake (the Easy Way)

Published in

Open House

8 min readSep 10, 2020

At Opendoor, data is central to everything we do. It follows, then, that the platform that manages all of that data is mission critical. Our analytics platform is built on Snowflake and runs hundreds of thousands of queries per day, answering questions and crunching through terabytes of data for our users.

Snowflake is a great tool for this job, as it manages this load with ease and can instantly scale up and down to handle spiky demand. But how do you effectively manage Snowflake itself?

This post outlines how and why we moved from configuring our Snowflake account manually to configuring it via Terraform, an Infrastructure as Code tool. We’ll cover how we worked around some of the limitations of Terraform to bring hundreds of pre-existing Snowflake resources under Terraform management — with no downtime and no hand-written config required.

Background

Snowflake describes itself as a “Near-Zero Management” platform, with minimal knobs to turn. It’s certainly true that, relative to some of its competitors, Snowflake requires very little tuning to achieve stellar performance.

However, even if it doesn’t require much performance tuning, any sufficiently complex data platform will manage hundreds or even thousands of objects. In Snowflake, some of the key objects include Warehouses, Users, Roles, Databases, Schemas, Tables, and Views — not to mention the Grants that control access to those objects.

Creating, modifying, and deleting these Snowflake objects can be done in the Snowflake web interface or by running Snowflake SQL.

As we migrated our analytics platform onto Snowflake in 2019, we did all of our Snowflake account configuration by hand. In the best case, we wrote runbooks with template SQL snippets; in the worst case, the process of configuring some aspect of our account was “tribal knowledge” that lived only in one engineer’s head.

This left us with a Snowflake account that was working well, but was hard to adjust and even harder to understand.

Problems

Our hand-crafted Snowflake account had a few problems:

Auditability: Which users were granted which roles? Which datasets could each role access? Answering these questions was difficult and required a highly-privileged Snowflake user.
Standardization: Which role/user should be used to create new objects? The role used by the creator matters because that role becomes the object owner and defines the default privileges on the object. Manual creation of objects is error-prone since an engineer might accidentally create an object using the settings of their unrelated last session.
Collaboration: A privileged user modifying a resource in the Snowflake console or by running SQL does so in a vacuum. While we trust that our teammates know what they’re doing, we’d prefer to be able to have a formal code review process of these changes.
History: Technically, all account changes are recorded and accessible via the QUERY_HISTORY view. However, parsing those logs requires both SQL and Snowflake knowledge, and we’d much prefer the changes be version-controlled with something universal like Git.
Speed: It’s annoying and slow to change these resources by hand. Our engineers’ time is valuable—we shouldn’t spend it on tedious tasks like writing the SQL command to adjust the default query timeout of a given warehouse.

Borrowing from DevOps: Infrastructure as Code

Thankfully, the DevOps world has a solution to these problems: Infrastructure as Code.

The basic premise of Infrastructure as Code (IaC) is that the state of your cloud resources should be defined by configuration files (typically checked into a version-controlled repository). Changes to the resources are made by updating those config files and “deploying” the change using a CLI. (It’s best practice to run this in a CI/CD pipeline).

When using IaC, you can grok the state of your resources by simply reading the config files, and you can collaborate on changes to your resources in the same way as you would other code: with git history, pull requests, code reviews, etc.

IaC tools were originally focused on managing the servers in a cloud computing environment, but today you can manage the configuration of basically any third-party tool your team uses. Consider our Snowflake use case: we don’t need to configure the actual virtual machines our account uses —Snowflake does that for us!—but we do want to manage key Snowflake resources like Users and Databases.

There are a number of IaC tools out there (Chef, Puppet, Ansible), but we settled on Terraform for two reasons:

Opendoor was already using Terraform to manage our AWS resources, so we had engineers with experience, and a Terraform release step was already included by default in all services’ CI/CD pipelines. (You can read more about pippy, our Python service generator, here.)
Terraform supports the main cloud providers (AWS, GCP, etc) out of the box—for everything else, you’ll need to install Community Providers. There are a couple open source Terraform Snowflake providers; we settled on the Chan Zuckerberg Initiative (CZI) Snowflake provider, since their Github repo had the most activity, and recent commits showed that it’s being actively maintained and developed.

Terraform: a Brief Introduction

We won’t cover Terraform in great detail here, but there are a few key concepts you’ll need to understand:

(1)Terraform Resources and .tf Files
The Terraform configuration language declares resources that map to components of your infrastructure. For example, a Snowflake user is defined by the snowflake_user resource in the CZI provider.

The configuration of these resources lives in .tf config files — a given .tf file can house many resource definitions, and you can have many .tf files in your terraform directory.

(2) The .tfstate File
The .tfstate file represents the current state of your “live” infrastructure. You will generally not edit this file by hand — instead, Terraform CLI commands will modify it for you.

(3) terraform plan and terraform apply
Running terraform plan will compare your current .tf files to your .tfstate file, and output a report describing how the configuration declared by your .tf files differs from the state of your current production infrastructure.

Running terraform apply actually makes the changes outlined by terraform plan in your live infrastructure. Be careful using this one!

Terraforming an Existing Snowflake Account

Let me pause here and offer some hard-earned advice: if you’re a data team setting up a new Snowflake account, drop everything and install Terraform and the CZI Snowflake provider. It is much, much easier to use Terraform from Day 1, making 100% of your account changes via changes to your .tf files.

We didn’t use Terraform from the get-go; changes to the resources in our account were made manually in the Snowflake console or using Snowflake SQL. To bring our Snowflake account under the management of Terraform, we needed to wrangle hundreds of pre-existing Users, Warehouses, Schemas, and other objects into our .tf and .tfstate files.

The process of retrofitting Terraform onto an existing Snowflake account had two main challenges:

(1) Importing Existing Resources into the .tfstate File

The .tfstate file is meant to be kept in sync with your actual infrastructure. Thus, the first step in bringing an existing Snowflake account under Terraform management is to generate a .tfstate file that reflects the current state of your resources.

Thankfully, Terraform provides a solution here — terraform import pulls existing resources into your .tfstate file. The CZI Snowflake provider supports this command. For example, you can import your PUBLIC role like so:

$ terraform import snowflake_role.public "PUBLIC"

Unfortunately, terraform import can only import one resource at a time; there is no way of batch importing all resources in a provider.

(2) Generating Resource Definitions

terraform import is handy, but it only solves half of the problem. It pulls existing resources into your .tfstate, but it does not generate resource definitions in your .tf files.

In fact, you won’t be able import a resource unless there is already a matching resource definition in a .tf file.

This means that every time you terraform import an existing resource, you first have to manually write a resource definition for that resource. From the Terraform docs:

While this may seem tedious, it still gives Terraform users an avenue for importing existing resources.

Given that for the Snowflake use case, a pre-existing account might have hundreds or thousands of resources defined, tedious is right!

Note: At time of writing, the latest version of Terraform (v0.13) is unable to automatically generate resource definitions via terraform import; according to their docs “a future version of Terraform will also generate configuration.”

Snowglobe: Programmatically Importing Snowflake Resources

Thankfully, we figured out a way to avoid running terraform import hundreds of times and writing thousands of lines of resource definitions by hand.

Our solution to these problems is snowglobe, an internal Python package we wrote that programmatically generates resource definitions for each existing Snowflake resource, appending the result to a .tf file and calling terraform import to pull the resource into the .tfstate file.

The key to this workflow is that all Snowflake objects in an account can be fetched programmatically using a privileged Snowflake user and Snowflake SQL. We can use that to our advantage by automating the process of fetching the current state of a resource, generating a resource config block, and running terraform import for the object.

Let’s consider Users as an example:

In snowglobe, we wrote a class for each resource type to import: in this case CZISnowflakeUser. The attributes of an object of this class match the properties of the relevant resource type from the Terraform provider, in this case snowflake_user. (name, default_role, etc):

class CZISnowflakeUser:
    def __init__(
        self,
        name: str,
        comment: str = None,
        default_namespace: str = None,
        ...    ):
        self.name = name
        self.comment = comment
        self.default_namespace = default_namespace
        ...
    def alias_resource(self):
        return f"{self.name.lower()}"

    def snowflake_provider_resource(self):
        return "snowflake_user"

Using the Snowflake Python connector, we can execute SHOW USERS and load all Snowflake users into a Pandas dataframe.
The CZISnowflakeUser class includes a method to parse a row from that dataframe into a CZISnowflakeUser object.
The parsed CZISnowflakeUser object is passed through a generic obj_to_terraform parser, which coerces the object into a Terraform-friendly resource definition string and appends it to the users.tf file, e.g.

resource "snowflake_user" "example_user" {  
  name                 = "EXAMPLE_USER"  
  comment              = "This is an example user"       
  default_namespace    = "PUBLIC"
  default_role         = "PUBLIC"
  default_warehouse    = "DEMO_WH"   
  disabled             = false  
  login_name           = "EXAMPLE_USER"  
  must_change_password = false
}

A terraform import command is generated and executed, importing the user into the .tfstate:

terraform import snowflake_user.example_user "EXAMPLE_USER"

The classes/parsing of all resource types is packaged up into a single main function that calls everything and generates the final .tfstate file and the corresponding .tf files.

These files represent the current state of the existing Snowflake account in the Terraform language and can be used as the starting point for a fully Terraform-managed Snowflake project.

All subsequent changes to the .tf files and a corresponding terraform apply will actually make those changes to the Snowflake resource.

Results and Next Steps

With snowglobe, we successfully brought our existing Snowflake account under Terraform management. We programmatically generated thousands of lines of config and imported hundreds of Snowflake resources without having to manually run terraform import once.

Using the snowglobe-generated .tfstate and .tf files, we were able to start making all subsequent changes to these resources via Terraform, with no downtime or user disruption required.

Managing the configuration of our Snowflake account through Terraform has solved the problems outlined earlier, but there is still some work to be done. There are a few additional Snowflake resources we want to implement in snowglobe and bring under Terraform management, the most important of which is Network Policies. The CZI provider doesn’t currently support Network Policies, so we’re planning on developing the feature and contributing it back to the project.

Sound like something you’d be interested in working on? We’re hiring! We’re looking for talented, passionate teammates to help build the future of real estate.