Snowflake for Data Science

Anthony Carminati
Pandera Labs
Published in
4 min readMay 17, 2018

--

Anyone working on a cloud-friendly data science, data engineering, or data warehousing team has surely heard the name Snowflake come up over the past couple of years.

For those that haven’t, Snowflake is a relatively new database solution that is majorly innovative in some ways and yet simultaneously familiar in the ways that matter.

The product is essentially a SaaS database built with cloud-native features that we 21st century data enthusiasts (fine… nerds) have come to expect from off-the-shelf products. Some of these features include:

  • the ability to quickly and automatically scale computing power — responding to highly variable workloads in a way that enables high velocity data to flow in, unimpeded by ingestion bottlenecks
  • the separation of storage and compute, both technically and on billing statements — making cold data retention much more cost-effective
  • the ability to segregate and securely share chunks of data — reducing the amount of maintenance and management that data ops teams need to invest in shared data assets
  • automatic query and data optimization — this one speaks for itself!

These updates aside, the best part of it all — and what I believe to be one of the biggest influences to Snowflake’s success — is the fact that data is accessible via the same SQL interfaces that so many people in enterprise data science are already familiar with (and have already used to build apps.)

In some ways, the product is similar to a Hadoop-style architecture where data is stored in well-organized flat files (well organized being the weakness of most failed data lakes) with something like a Hive access layer to enable JDBC connections. The main difference between the Hive example and Snowflake is the engine that executes queries and computations against those flat files. In the case of Hive, it would normally be accompanied by a host of other à la carte services that would need to be properly planned and maintained, and would use a SQL-like interface (not true ANSI SQL.)

This means that, when using Snowflake, you don’t need to hire a team of experts to maintain your Hadoop ecosystem and your data scientists don’t need to learn new skills to store and extract data.

So all of that is good and dandy, but why should data scientists care?

Quick and Automatic Scaling

While the typical data scientist isn’t super concerned with infrastructure costs, budgets certainly do exist and they affect the data scientist’s ability to access high performing systems and tools. Given the elastic nature of Snowflake’s computing resources, teams can let their warehouses go dormant when not in use and then have them available very quickly once they’re ready to run queries again.

We regularly see provisioning and connection times around one second when querying against a dormant warehouse.

This means that instead of running a medium server for a data science team 100% of the time, it’s as cost effective to run a much more powerful server only when needed.

Cold Data Retention

What’s the most common request from data scientists conducting exploration or training machine learning models?

We want more data!

By separating storage and compute (where storage is extremely cheap) companies don’t need to think twice about retaining cold data for either compliance purposes or for more complete retroactive analytics at some later date.

At Pandera, we do a lot of work in the Internet of Things (IoT) space. These are normally greenfield projects with very few data assets existing at project kickoff. Software engineers work with data scientists to design pipelines around data that is likely to provide value (subject to exploratory data analysis) once there is sufficient volume. With a traditional database solution, this means that a lot (read: a lot) of unnecessary money would be spent on resources that would largely remain dormant until the data scientists are ready to step in. With Snowflake, that’s no longer a concern.

One point to consider, particularly when speaking about IoT, is that there is likely to be a constant inbound trickle of data from one or many “streams”, which you wouldn’t normally see in a typical data warehouse architecture. This use case requires some extra planning to maintain the cost-effectiveness of Snowflake, but one of my colleagues plans to address this in a subsequent post, so stay tuned!

All in all, you can tell by this post (and by about half a dozen other posts on Pandera Labs’ blog) we’ve had a great experience using Snowflake as a data warehouse, as an analytical data store, and as a historical system of record for streaming data architectures. I look forward to seeing how they continue tailoring their product to data scientists and especially for IoT use cases!

At Pandera Labs, we’re always exploring new ways to build products and we enjoy sharing our findings with the broader community. To reach out directly about the topic of this article or to discuss our offerings, visit us at panderalabs.com.

--

--