Photo by Ian Froome on Unsplash

Automating Your Snowflake Database Cloning with GCP

Phil Goerdt
7 min readMar 28, 2019

Why you shouldn’t be working with old data, and how to change it

Out with the old

I recently wrote a blog post arguing against the old ways of thinking about and working with data. Specifically, the notion that development is using stale, old or bad data to get their jobs done. Not only that, but traditionally developers have been hamstrung by having to share the same data for different purposes. Why is it that we’re still seeing this development and design paradigm persist as we move to cloud native environments? Maybe more importantly, should we continue to allow that?

This blog will focus on cloning databases in Snowflake with the intent of Snowflake customers being able to break free of those constraints. By using the Zero Copy Clone feature that Snowflake has, we can create Production level parity for developers and testers on a regular basis with near zero cost implications. We can also automate and schedule these refreshes so developers are always working with data that is representative of the real world. This enhances quality throughout the entire SDLC. Seems like a no brainer, right?

In with the new

Let’s talk about how to get this done. I realize that most Snowflake customers will be AWS or Azure customers. That’s all well and good, and what I am proposing can be done on those platforms as well. However, I’m going to do this on GCP for a few reasons:

  1. I want to show you don’t have to use either of those platforms.
  2. I have already done something similar to this in another blog for BigQuery.
  3. I like using GCP the most.

With those pesky details out of the way, let’s talk about the solution.

Getting down to business

The basic premise of this solution is pretty simple. Clone a database on schedule. For this we will need:

  1. A way to execute a command to tell Snowflake what to clone
  2. A way to tell that command to run

How do we go about doing that in GCP? Thankfully there is a pretty easy way to do this.

GCP to Snowflake. Who would have thought that was possible?

We’ll start at the beginning, which is with Cloud Scheduler. This is an easy way to set up an event timer that can publish messages to Pub/Sub, or trigger events in App Engine, or even hit HTTP endpoints. We’ll be sending messages to PubSub in this example.

PubSub is GCP’s streaming solution. If you’re familiar with AWS, think Kinesis, SMS or SQS. If you’re an Azure user, think Service Bus. If you’re crazy about open source, think Kafka. Regardless which flavor you prefer, the idea is to have the Scheduler send a message to Pub/Sub, to which the Cloud Function is a subscriber, and listening for messages.

Cloud Functions are similar to AWS’ Lambda Functions or Azure Functions, which are all serverless ways to execute code. It’s best to think that Functions are the glue that hold your cloud ecosystem together, in that they have redundancy and can respond to pretty much any event. The function we create is going to execute a call to Snowflake, and then run the command to clone the database.

If you wanted something similar in AWS, you could use CloudWatch’s cron functionality to kick off a Lambda function that would carry this code. (Don’t worry, I have another blog showcasing how to do this in AWS on the way.) I’m sure there is something similar in Azure, but I’m not familiar enough with that platform to make a recommendation. (I’ll leave that to you, Azure gurus, to figure that one out… but I would imagine it would be a similar set up.)

What this functionality could look like in AWS.

Getting our hands dirty

I’m going to use Node.js in this example, but I’m confident that you could use Python for this as well.

Setting up the function

Let’s start by looking at the code we are going to implement. This is a simple example and I am including the account, username and password in the function. (This could be worked around by using a config file, or maybe your security team has recommendations.)

 //import the necessary package
var snowflake = require('snowflake-sdk');
//create the connection
var connection = snowflake.createConnection({
account: '<your snowflake account here>',
username: '<your username>',
password: '<your password>'
});
connection.connect(function(err, conn) {
if (err) {
console.error('Unable to connect: ' + err.message);
} else {
console.log('Successfully connected as id: ' + connection.getId());
}
});
//run the sql statement
var statement = connection.execute({
sqlText: 'create or replace database phils_dev_database clone demo_db;',
complete: function(err, stmt, rows) {
if (err) {
console.error('Failed to execute statement due to the following error: ' + err.message);
} else {
console.log('Successfully executed statement: ' + stmt.getSqlText());
}
}
});

What this code is doing is taking the account details, reaching out to Snowflake through the Node.js Snowflake driver, and making the connection. Once the connection is established, it runs the create or replace ... clone... command. You can see I am doing a small, simple example here, but just replace the database names with your own and you’ll be ready to rock.

Since we have our clone code ready, let’s take a look at setting up the Cloud Function. Log into the GCP console, click on “Cloud Functions”, and then click on “New Function”.

Here are the basic parameters we’ll use for our function. Go ahead and enter a name for the function. Since the function is lightweight, select 128MB for the memory. Then select “Cloud Pub/Sub” for the trigger, and create a new Topic via the dropdown menu. If you already have a topic you’d like to use, feel free to select that. Finally, make sure you’re using the inline editor and the Node.js 8 runtime.

After you’ve set up those options, you’ll notice that there is a code window below the runtime selector, with some default code. The default code may look something like this:

exports.myFirstFunction = (event, context, callback) => {
const pubsubMessage = event.data;

<function logic goes here>
callback();
};

What this logic does is it ensures the function runs correctly. The function responds to an event (the PubSub message), executes the context (the logic we created above), and then sends the callback (telling the function it is finished). Go ahead and copy the Snowflake function logic and paste it between the context and callback sections. Then, make sure you update the function name and callbox below the editor.

We’re almost there, but there is one last thing to do; make sure we have the dependencies correctly configured. Click on the “package.json” button above the inline editor, and list the dependencies in that editor box. Mine looked like this:

{
"name": "snowflake-clone",
"version": "1.0.0",
"dependencies": {
"@google-cloud/pubsub": "^0.18.0",
"snowflake-sdk" : "^1.1.11"
}
}

Once we have that set up, click on “Create Function”. This can take a minute, but once it is finished, you can test the function in the Cloud Functions page on the console.

It works!

Congrats on getting your function set up and working! Let’s work on scheduling it now.

Right on schedule

From the GCP Console navigate over to the Cloud Scheduler page. Then click on “Create Job”. This brings you to a pretty simple set up screen.

Here we enter the name of our job, the description and the frequency we want it to run (in cron format). Then enter the timezone, and Target. We are using PubSub, so select that for the target, and then type in the topic name that we created for our function. Put something in the payload — adapted movie quotes are always a good and fun idea, in my opinion.

Once you’re completed the configuration of the Scheduler, click “Create”. If you want to test your Scheduler right away, once it is deployed you can click on “Run Now” from the Scheduler page in the console.

Congrats! You’ve just automated your Snowflake clone strategy!

Conclusion

Just because we’ve always done something one way doesn’t mean we have to continue to do it that way. If you’ve made the switch to cloud computing and Snowflake, make sure you harness the full capability of the cloud by taking advantage of changing how you can work, not just how what tools you use to do that work.

Phil Goerdt is the founder of Erteso, a consultancy that focuses on cloud and data solutions, which is a Snowflake partner. He is certified in Snowflake, is a GCP Certified Professional Cloud Architect and a GCP Certified Professional Data Engineer. You can contact him at phil.goerdt@erteso.com.

--

--

Phil Goerdt

Consultant. Data & Cloud architect. Founder. Petrolhead. Traveler. Foodie. Geek. Opinions my own.