Quick Tips for Using Snowflake with AWS Lambda

A Working Example Using AWS Lambda Serverless Compute and Snowflake’s Cloud Data Warehouse Together

by Venkatesh Sekar

Snowflake and Serverless — both technologies are getting a great deal of attention, and deservedly so.

For those that may not be as familiar with one or both, I’ll provide a very quick summary of each technology, but my main focus for this post is demonstrating how to code a simple AWS Lambda Serverless compute function that communicates with SnowFlake’s Cloud Data Warehouse.

I’ll provide a working example as well as some points to consider as you go down this path. All code samples referenced can be accessed in GitHub at lambdasnowflake.

Snowflake Cloud Data Warehouse

The Snowflake Cloud Data Warehouse is a cloud-native, fully relational ANSI SQL data warehouse service available in both AWS and Azure with patented technology providing tremendous performance, unlimited scalability, and a consumption-based usage model for a wide variety of analytic workloads.

Some key features of Snowflake are listed below:

  • Cloud native
  • Zero management
  • Time travel
  • Scale however, wherever (Data, Compute, Users)
  • Natively load and optimize both structured and semi-structured data such as JSON, Avro, or XML — all available via SQL
  • Easy connections to BI Tools
  • Pay as you go usage (per second)

…and much more.

For a great review of Snowflake’s Cloud Data Warehouse in action, take a look at Ed Fron’s post linked below:

AWS Lambda Serverless Computing

AWS Lambda enables you to focus on the code and logic you are trying to execute rather than how you are trying to run that logic and what the underlying infrastructure is — you don’t worry about infrastructure, concurrency, etc. and code runs without any provisioning or managing of servers. Plus you pay only for the compute time you consume — there is no charge when your code is not running.

There are a variety of use cases that AWS references for Lamdba including:

  • Real time file processing
  • Real time stream processing
  • ETL
  • IoT backends
  • Mobile backends
  • Web apps

If you have further questions on Serverless — what is it, when to use it, when not to use it — I’d suggest that you checkout this recent Hashmap IoT on Tap Podcast where Serverless was the focus.

Putting it to Work — Our Scenario

Typically when loading data into Snowflake the preferred approach is to collect large amount of data into an S3 bucket and load (example via COPYcommand). This is also true of loading mini batched streaming data of at least 100mb. The Snowflake blog provides a couple of great examples of this which are linked below:

  • Snowflake Blog — DESIGNING BIG DATA STREAM INGESTION ARCHITECTURES USING SNOWFLAKE — Part 1 and Part 2

There are some use cases where data obtained is smaller in size or record counts are less (think less than 1000 records). In these cases a direct write operation can be a better approach.

This code writeup is a demonstration of that type of scenario.

Kijiji

I live in eastern Canada and here we have a social platform called Kijiji to buy and sell pretty much anything. Think of it as very similar to the well known Craigslist. Kijiji offers an RSS feed of the ads posted, an ex Kijiji-rss-link, and displays ads across Canada and across all ad categories. Typically on each retrieval you might receive approximately 20 or so hits (records), and the feed does get updated frequently.

For our Kijiji example, the AWS Lambda function retrieves the feed, parses the content, extracts some information and stores the data into Snowflake.

Main Entry Function

The entry point to the Lambda function is implemented in class KijijiLoaderFnmethod handle. The lambda is set to get triggered based on AWS CloudWatchscheduler every 5 minutes.

Service Account

To communicate with Snowflake you typically create an application service account with user/password. Snowflake also provides authentication using keypair methodology with a rotational functionality.

In a normal execution environment (non-Lambda), the private key is stored in a secure location and the program would access from the path.

In a Lambda environment, since the container is dynamic, a regular file location can not be provided. You can adjust to this by having the function load the private key file from the specific protected S3 bucket.

AWS Secrets Manager

A better approach though is to use AWS Secrets Manager which allows you to store all the necessary connections for a specific Snowflake service. In this example I have stored the following:

  • URL
  • Snowflake account
  • Private key passphrase
  • Private key file content

At runtime, a quick API call to this specific key would return the entire record as JSON data. The program could parse the JSON and go about authenticating with Snowflake. The code for this is demonstrated in the trait class DBConnPoolInitializermethod retrieveDBConnInfo and to parse the JSON in method parseDBSecretInfo.

NOTE: In example above is hardcoded, but in a production scenario, I would recommend retrieving the awsSecretSnowflakeKey via an environment variable. The environment variable can be defined as part of the deployment process.

Connection Pool

The code instantiates and sets up a connection pool, via the well know library HikariCP. The commonly adopted pattern of connecting to a data source using the user/password works, however there is currently no implementation for connecting to a data source using the keypair methodology mentioned above.

To address this I created a simple implementation which can be configured in Hikari CP. The datasource implementation is the class SnowflakeKeypairDataSource.

The parsing of the private keypair and connecting is present in the methods getEncryptedPrivateKey and getConnection. Configuring the Hikari library is in the class DBConnPoolInitializer on the method instantiateDataSource.

Lambda Startup

The Lambda execution happens randomly and there is no control over how a Lambda execution container gets allocated. There are no lifecycle events to indicate to your application code that the container is initialized. By the time your Lambda function executes the container is already initialized. This article from Yan Cui provided a good explanation for some of my initial work.

In such a condition creating and maintaining a connection pool would be un-deterministic, however, it’s well demonstrated that AWS looks to reuse an existing Lambda container if there are frequent invocations.

Implementing a static block in the Lambda class provides a good spot for such a one time initialization. I adopted this approach and it is done via the object class [ContainerInitializer (./KijijiLoaderLambda/src/main/scala/com/hashmap/blog/initializer/ContainerInitializer.scala). The class gets instantiated as part of the main class KijijiLoaderFn.

NOTE: Do not use this approach to store application states as the container could be destroyed at any point in time by AWS.

The ‘/tmp’ Folder

As part of communicating with Snowflake and ensuring ideal performance you want to set a folder for Snowflake to store cache. For this, we could use the ‘/tmp’ folder from the Lambda execution environment. We set the environment variables to this folder during runtime as demonstrated in DBConnPoolInitializer method initializeConfigurationToDB.

NOTE: It is ok with limitation to use the /tmp folder, however I generally would avoid misusing the folder to store large amounts of processing data as the storage size limited currently to 512MB and could change.

Execution Time limits

Currently an AWS Lambda function can run a maximum of 15 minutes, and knowledge of such limits should be considered when implementing the function. The limits are well documented and you can review them here. This code however executes on average in 3–5 seconds, well below the 15 minute limit.

NOTE: The functionality demonstrated here does not mean that Lambda is slow (compared to certain use cases). SLAs were not a consideration and the functionality is merely demonstrating connecting to Snowflake from Lambda.

AWS SAM and AWS SAM CLI

When developing it is usually a pain to build the code, deploy to AWS, and run. It’s time consuming and also could cost you with those frequent uploads.

I recommend developing using the AWS SAM CLI tool. By using this I could test locally and also deploy it to AWS Cloud Formation. The template to use for local testing is local_samcli.yaml and when deploying to AWS use the template deploy.yaml.

Policies

For proper functioning the Lambda function needs the following policies:

  • S3 bucket get
  • AWS secrets manager read
  • Lambda basic execution

For this demo I used the existing pre-defined managed policies as per the below.

Feel free to ignore the SNS from the definition as it was used for another demo.

Observations

Below is the monitoring screenshot after this Lambda ran for a couple of hours:

The Lambda gets triggered every 5 minutes, and each invocation is typically 3 seconds in duration.

Instance count

Even though I have set the max concurrency to 5, since the function finishes within the limit, AWS opts to reuse the same instance on each reschedule. So normally there is only 1 execution as observed in the Concurrent executiongraph.

Initialization

Below is from the AWS CloudWatch extract log:

You can confirm that initialization happened only once but normal function which writes to Snowflake happens regardless demonstrating the fact that AWS tends to reuse the Lambda container between execution.

Gotchas

You might be wondering…what’s that big spike around 11:00 and how come execution time went up? Well those are the perils of budget. I ran out of the Snowflake credits (more on that later) that I was using for this example.

This does demonstrate that when you are designing for the same type of mini batch dataloads, you should consider the cost. If your use case allows you to delay the loading to a later time (like every hour) and if there is a possibility to store the data (maybe in a kafka queue), it might be better to load data every hour instead of every minute and use the COPY function.

Other Scenarios

While this walk thru focused on interfacing with Snowflake and doing a small batch data load, you could extend what we’ve demonstrated above and build a Lambda function coupled with an AWS CLOUD API to host a REST endpoint.

From there you could invoke the endpoint to provide real-time data retrieval (while doing a massive calculation in the Snowflake cloud data warehouse) and display in your UI of choice (web page, dashboard, or other).

Final Notes

As demonstrated above, connecting Snowflake and AWS Lambda is not only possible, it can be very beneficial depending on your use case. I’ve really focused on the basics and certainly there are other considerations when looking to take this approach.

Lastly, I’ve also had good success developing AWS Lambda in Python and communicating with Snowflake for various client use cases. More on that in later posts.

If you’d like to give Snowflake a test run, the company provides $400 in free credits to get you started.

Also, you can sign up for the AWS Lambda service and get 1,000,000 free requests per month and up to 3.2 million seconds of compute time per month.

Feel free to share on other channels and be sure and keep up with all new content from Hashmap here including other Snowflake stories.

Venkat Sekar is Regional Director for Hashmap Canada and is an architect and consultant providing Data, Cloud, IoT, and AI/ML solutions and expertise across industries with a group of innovative technologists and domain experts accelerating high value business outcomes for our customers.

Be sure to catch Hashmap’s Weekly IoT on Tap Podcast for a casual conversation about IoT from a developer’s perspective.

--

--