Supercharging Data Science with Snowflake

Sudha Regmi
Cervello, a Kearney Company
5 min readFeb 25, 2020

Anyone who works in a data science department knows that the work such departments do is complex, that good people are hard to find and that they are expensive to hold onto. What isn’t as widely appreciated is the amount of time that most data scientists spend on low-value data engineering work.

To understand why a team of highly trained analytics experts would have to spend so much time preparing and cleaning data — the rudiments of data engineering — I’ll use a relatively common example. Suppose the team has just been asked to predict the likely sales of a new product. An assignment like this isn’t usually accompanied by a clean data set; instead, the data science team must go out and find the data they need, typically from multiple sources. With the data coming in a variety of formats and from a variety of sources, the job of getting the data ready for advanced analytics — a prerequisite to generating actual insights-is often very complicated.

At Cervello, one way we address these challenges is by using Snowflake, a cloud data warehouse. We like Snowflake precisely because it gives us quick access to governed data and allows us to blend it with other non-standard or loosely governed data. By doing so, we are able to cut down on data prep time and focus on high-value problems — namely, building machine-learning models to enable better decision-making.

The diagram below, which uses the example of a company trying to generate a highly refined sales forecast, shows many of Snowflake’s benefits. Snowflake acts as the data warehouse, with a large store of governed internal data –most of it sales history. Snowflake allows for the use of SQL to pre-process the data and get it ready for the higher-value tasks of data science. The pre-processing steps include data cleansing, data normalization, data profiling, and imputing missing values.

Operationalizing Data Science with Snowflake: data flow between Snowflake and data science tools

At that point, other assets of Snowflake kick in. These include an end-to-end connection to create a Machine Learning pipeline, and an ability to combine both structured and unstructured data together. It also includes a native Python connector enabling queries in Jupyter Notebook to run on Snowflake’s scalable engine for data prep and quick data exploration. The use of open-source Jupyter Notebook gives data scientists the option to switch between SQL and Python depending on the pre-model kind of tasks they want to do. For example, they can then run Machine Learning algorithms on the data.

THE SNOWFLAKE SANDBOX

One thing that’s especially appealing about Snowflake is its sandbox environment. There are four aspects to this:

  1. Easy access to data. Data scientists get what they need via either read-only access to a governed data layer or by leveraging Snowflake’s zero-copy cloning feature.
  2. Security model. Snowflake’s Role Based Access Control (RBAC) security model has a level of granularity that makes it very easy to operationalize sharing of different schemas and databases across teams.
  3. Separation of compute and storage. This aspect of Snowflake’s design means that the data science department’s workloads will not create concurrency issues for other workloads, such as data processing needs, end-user queries, ad-hoc requests, etc. Workloads can be configured to be isolated from one another yet share the same data store.
  4. Scalability. In Snowflake’s case, scalability means that data scientists can process millions of rows of data and execute complex joins without having to worry about Hadoop or Spark clusters. It also means they can do easy data profiling using extended ANSI-SQL; create stored procedures using Javascript; and consume the output of data profiling in Jupyter Notebooks via summary tables and Python-based visualizations.

HIGHLIGHTS OF SNOWFLAKE’S ARCHITECTURE

Perhaps the biggest benefit of Snowflake is that data scientists and their teams can seamlessly leverage the tools and technologies they use every day in conjunction with the platform. The custom connector makes it possible to push down SQL queries to Snowflake. This speeds up simple data engineering activities such as filters, joins, and group bys, without having to leverage Pandas (which can be slow). Also, the auto-scaling of virtual warehouses that occurs as part of Snowflake is a powerful answer to the problem of query complexity.

Other benefits include flexible code-writing (either SQL or Python); the avoidance of inefficient data replication processes; the ease with which results can be directly stored on Snowflake; and the ability to feed the output of the Machine Learning model into business-intelligence tools. Such tools (including Tableau and PowerBI) are used in the creation of deliverables, like quarterly sales dashboards, that can be readily used by business audiences. Models can be scaled up by leveraging partner technologies such as AWS Sagemaker and Databricks.

Snowflake’s end-to-end connections, scalability, easy sandbox provisioning, level of SQL support and python integration have made it a powerful tool for us. We think it would be equally powerful for any data science function looking to spend less time organizing data and more time generating the business insights that matter.

If interested in learning more, feel free to comment, ask questions below, or reach out!

About Cervello, a Kearney company

Cervello, is a data and analytics consulting firm and part of Kearney, a leading global management consulting firm. We help our leading clients win by offering unique expertise in data and analytics, and in the challenges associated with connecting data. We focus on performance management, customer and supplier relationships, and data monetization and products, serving functions from sales to finance. We are a Solution Partner of Snowflake due to its unique architecture. Find out more at Cervello.com.

About Snowflake

Snowflake delivers the Data Cloud — a global network where thousands of organizations mobilize data with near-unlimited scale, concurrency, and performance. Inside the Data Cloud, organizations unite their siloed data, easily discover and securely share governed data, and execute diverse analytic workloads. Wherever data or users live, Snowflake delivers a single and seamless experience across multiple public clouds. Join Snowflake customers, partners, and data providers already taking their businesses to new frontiers in the Data Cloud. snowflake.com.

--

--