Snowflake
Published in

Snowflake

Managing Snowflake Infrastructure with Python

What? Managing Snowflake infrastructure? There is supposed to be no infrastructure to manage. That is one of the Snowflake selling points and benefits — almost zero maintenance and no infrastructure to manage. True, but as an „infrastructure“ you can consider also things like users, roles, and privileges or warehouses which need to be managed somehow. Basically, we can say that account-related objects could be taken as an infrastructure of your account.

Infra as a code!

You can do it manually and get lost with 10+ objects to manage or use a code for it. Managing infrastructure as a code is a very popular approach in cloud environments and it is one of the essential parts of DevOps or DataOps. There are multiple technologies that might be used for that. Technologies like Terraform or AWS CloudFormation.

Managing infra as a code has many benefits and it can significantly improve your delivery in many aspects. Let’s try to highlight some of the benefits:

Automation

Once you keep infrastructure in the code, you can automate the deployments. This brings other benefits like speed and costs reduction or the possibility to test your setup as part of the deployment process.

Improved reliability

By automating the infrastructure deployments you are eliminating human errors

Versioning

Keeping infrastructure as a simple text file allows us to version it in some GIT repository and easily compare changes over time. In case of any issue, it is not a big deal to go back and revert to the previous version.

Configuration traceability

It is also easy to trace configuration changes when you keep them as a file and under the versioning system.

Cost save

Last but not least you are saving costs because you are removing people’s manual effort. They can focus on different tasks and not manage the infrastructure.

Snowflake and infra as a code

How can you manage Snowflake objects as a code? There are multiple options and the first one is already mentioned — Terraform. It offers a Snowflake provider and if you are familiar with Terraform you can manage also Snowflake objects with this technology.

If you would like to test it and get your hands dirty then Snowflake offers a quick start guide for it. You can find it here: Terraforming Snowflake

But maybe you are not familiar with Terraform or you do not want to introduce another technology into your stack. What are other options? I would like to introduce our own solution which uses custom Python code instead of Terraform and manages the Snowflake objects in a similar way how Terraform does. Let’s dive in!

Python as an infrastructure master

We have been looking for some easy way how to manage Snowflake account-related objects in a programmatic way and we did not want to use Terraform for it. The reason why not use Terraform was our limited knowledge of it, we were looking for quite a simple solution to start with. Something that can be implemented quickly. As far as I remember there was not much information around terraform snowflake provider in the middle of 2020 when we started with this activity

The initial thoughts were about keeping the warehouses, roles, users (account objects), databases, and schemas (high-level DB objects) in code in order to have visibility, version control, and traceability of the current setup. We also wanted to have automation in place and not manage all of it manually. We’ve decided to use JSON files as a data storage for objects definition and Python script as a handler. The script will be preparing the DDL commands for Snowflake and then also run them via Snowflake Python Connector. As an orchestrator, we use GitHub Actions because we use GitHub as our GIT environment. You can see the high-level schema of the whole setup below.

Set up high-level schema

The central and main part of the setup is the GitHub repository where everything takes place. When there is a need to update infrastructure ( e. g. new user, grant update for existing role), admins write the changes into the JSON file. Once they are done and commit the changes into the repository, they have a possibility to run GitHub actions workflow to deploy the changes into the DEV environment. I should also add that changes must be done into the non-master branch. Master is updated as part of deployment via pull request — when there is a deployment time and changes are verified and tested in dev. Another GitHub actions workflow is then automatically triggered when a pull request is closed.

JSON file structure

Let’s have a look at how the file with resources is structured. We have separate sections for each main object like warehouses, roles, users, and databases. Each section then contains different parameters based on object type because you need to configure different attributes for warehouse and role. Below you can find examples of such JSON for a small project.

{
"warehouses" : {
"LOAD_WH" : {
"size" : "XLARGE",
"max_cluster_count": "1",
"auto_suspend" : "5",
"scaling_policy": "ECONOMY"
},
},
"roles" : {
"MY_ADMIN" : {
"granted_to" : ["SYSADMIN"],
"use_warehouse" : [],
"admin_warehouse" : ["LOAD_WH"],
"imported_roles" : [ ]
},
"MY_DEVELOPER" : {
"granted_to" : ["MY_ADMIN"],
"use_warehouse" : ["LOAD_WH"],
"admin_warehouse" : [],
},
},
"databases" : {
"MY_DB" : {
"db_owner" : "MY_ADMIN",
"object_owner" : "MY_DEVELOPER",
"schemas" : {
"MY_SCHEMA" : {
"transient" : "true",
"role_grants" : {
"rw" : [ "MY_DEVELOPER" ],
"r": [ "MY_ANALYST" ]
}
},
"users": {
"my_user_id" : {
"firstname": "John",
"lastname": "Doe",
"system_user" : "false",
"roles" : ["MY_ADMIN"]
}
}
}

As I already said, the big benefit of JSON is that it is structureless. In case there is needed to add a new attribute, we can just add it without breaking already covered parts.

Python handler

Python script then reads this configuration from JSON and prepares DDL statements for Snowflake. Part of the process is also context switching as some commands need to run under the SECURITYADMIN role and for other ones SYSADMIN is enough.

Created DDL script contains following statements:

USE ROLE SYSADMIN;
CREATE WAREHOUSE IF NOT EXISTS LOAD_WH initially_suspended=true;
ALTER WAREHOUSE LOAD_WH SET WAREHOUSE_SIZE=XLARGE max_cluster_count=1 auto_suspend=5 scaling_policy=ECONOMY;
USE ROLE SECURITYADMIN;
CREATE OR REPLACE ROLE MY_ADMIN;
GRANT ROLE MY_ADMIN TO ROLE SYSADMIN;
GRANT ALL ON WAREHOUSE LOAD_WH TO ROLE MY_ADMIN;
...
...

Logging

Part of the deployment process is also logging. The whole DDL script is always stored in the Snowflake logging table so we can check what kind of infrastructure has been deployed into the Snowflake account or what changes have been done in case of any issue.

What we have learned — current solution limitations

The current solution has some limitations which were not a big issue in the early stages when we just wanted to have a solution to quickly build the environment for the first time or re-build it completely as a new one. But once we got more into regular updates of infrastructure with a growing number of users it is limiting. The biggest issue right now is the way how the roles are managed. As you can see in the example above, roles are created or replaced with every run. It has several consequences:

  • users which are currently logged into Snowflake and using one of the roles need to reconnect because the current role is deleted (replaced) by a new one (no matter it has the same name). Users get the following error message:
The role activated in this session no longer exists. Login again to create a new session.
  • because roles are replaced during DDL script runtime it means that objects, where they are owners, will get a new owner which is the role that had created the replaced role — SECURITYADMIN. Meaning that as part of the DDL script we also need to have commands which are transferring ownership back to desired roles.

Why it has been done this way? Simplicity. We do not need to take care of what already exists in the Snowflake account and what privileges are assigned to a particular role. Every time it runs it creates everything from scratch — roles and their grants. To solve this issue we would have to compare what roles already exist in the Snowflake account and compare it with the content of the JSON file. Then only new roles would be created. But we would have to take care of grants as well. We need to add logic that would be comparing currently implemented grants with the content of the JSON. This is not needed now because everything is replaced every time. Handling this comparison will add another complexity to the whole process but on the other hand, the whole deployment should then cause less disturbance to end-users.

I think rebuilding everything from scratch can work when you have a small team with few users but once the solution is growing and more and more users are onboarded it should be changed to not affect users' work during change deployments.

This is my current goal. Make it a more general solution which would be causing possibly no disturbance to end-users whenever it runs. It is a learning process. We have started with as easy a solution as possible. We have identified its limitations over time and now pushing it further. 💪🏻

CI/CD pipeline integration

The whole process could be included in CI/CD pipeline. Some time ago I wrote a blog post about CI/CD for Snowflake which is using GitHub Actions and a Python library called SchemaChange. This part covering Snowflake infrastructure could be integrated into the same workflow and you can have a single process that will be handling both — Snowflake infrastructure (account objects) and database objects. This is just another small piece of the complete CI/CD pipeline for Snowflake. ❄️

Summary

So that’s all! We have covered how can be Snowflake infrastructure managed programmatically with Python. It could be an option for those of you who are looking for a lightweight approach. The presented solution is not perfect and bulletproof. It is created according to our own project needs and yours might be different. But it is flexible and could be easily modified to fit another project setup. Anyway, I have also Terraform for Snowflake on my to-do list because you still need to explore new possibilities and think about how to move forward!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Tomáš Sobotík

Tomáš Sobotík

281 Followers

Lead data engineer @Tietoevry. Currently obsessed by cloud technologies and solutions in relation to data & analytics. ☁️ ❄️