Snowflake’s Cloud Data Warehouse — What I Learned and Why I’m Rethinking the Data Warehouse

Achieving Performance and Simplicity in the Cloud for ORC Data Loads and Tableau Visualization

Hashmap
Hashmap
Jun 28, 2018 · 13 min read

Why the Need to Rethink?

After spending significant time recently in sizing, balancing, and tuning an on-premise data warehouse environment, performance just wasn’t where it needed to be for Tableau workbooks meant to analyze consumer device hardware data.

You Have to Consider the Cloud

Hang on a minute — it’s no small task to research and conduct a comprehensive assessment of all the data warehouse options out there. Where do you start? How do you select a solution that can outpace the existing platform or other traditional data warehousing solutions based on performance, scalability, elasticity, flexibility and affordability?

  1. As needed scale up / scale down capability
  2. Outsourcing the challenging operations tasks of data warehousing management and security to the cloud vendor
  3. Potential to pay for only the storage and computing resources actually used, when you use them

Selecting Snowflake Cloud Data Warehouse

For this particular implementation, the existing data ingestion process was stable and mature. Daily, an ETL script loads the raw JSON file from the file system and inserts the file’s contents into a SQL table which is stored in ORC format with snappy compression.

Snowflake Architecture

Snowflake is a cloud data warehouse built on top of the Amazon Web Services (AWS) cloud infrastructure and is a true SaaS offering. There is no hardware (virtual or physical) for you to select, install, configure, or manage. There is no software for you to install, configure, or manage. All ongoing maintenance, management, and tuning is handled by Snowflake.

  1. Query Processing — Snowflake provides the ability to create “Virtual Warehouses” which are basically compute clusters in EC2 that are provisioned behind the scenes. Virtual Warehouses can be used to load data or run queries and are capable of doing both of these tasks concurrently. These Virtual Warehouses can be scaled up or down on demand and can be paused when not in use to reduce the spend on compute.
  2. Cloud Services — Coordinates and handles all other services in Snowflake including sessions, authentication, SQL compilation, encryption, etc.

Connecting to Snowflake

Snowflake makes connecting to databases very easy and provides a few different methods to do so. One method is to use any of the supported ODBC drivers for Snowflake. Additionally, SnowSQL CLI (install instructions are found here) can be leveraged or use the web based worksheet within your Snowflake account.

Setting Up A Virtual Warehouse

Snowflake defines a virtual warehouse as a cluster of compute resources. This warehouse provides all the required resources, such as CPU, memory, and temporary storage, to perform operations in a Snowflake session.

CREATE WAREHOUSE ETL_WH
WITH WAREHOUSE_SIZE = 'SMALL'
WAREHOUSE_TYPE = 'STANDARD'
AUTO_SUSPEND = 600
AUTO_RESUME = TRUE;
CREATE WAREHOUSE QUERY_WH
WITH WAREHOUSE_SIZE = 'MEDIUM'
WAREHOUSE_TYPE = 'STANDARD'
AUTO_SUSPEND = 600
AUTO_RESUME = TRUE;

Databases, Tables and Views within Snowflake

All data in Snowflake is maintained in databases. Each database consists of one or more schemas, which are logical groupings of database objects, such as tables and views. Snowflake does not place any hard limits on the number of databases, schemas (within a database), or objects (within a schema) that you can create.

Snowflake Databases

As we go through creating a database, keep in mind that our working project constraints were as follows:

  1. Maintain backward compatibility with the existing Tableau workbooks
CREATE DATABASE MULTI_COLUMN_DB;
CREATE DATABASE SINGLE_VARIANT_DB;
CREATE SCHEMA "MULTI_COLUMN_DB".POC;
CREATE SCHEMA "SINGLE_VARIANT_DB".POC;

Multiple Column Tables

The multiple column table specifies the same columns and types as the existing DW schema.

CREATE TABLE "MULTI_COLUMN_DB"."POC".DEVICEINFO (
id string,
manufacturer string,
serialNumber string,
etimestamp timestamp,
model string,
accuracy int,
batch_id bigint,
event_date date)
cluster by (event_date);

Single Variant Column Tables

Execute the following command to create a table with a single variant column:

Views

A view is required to meet the constraint of providing backward compatibility with the existing queries to access the single variant tables.

CREATE "SINGLE_VARIANT_DB"."POC”.VIEW DEVICEINFO as
SELECT
$1:_col0::string as id,
$1:_col1::string as manufacturer,
$1:_col2::string as serialNumber ,
$1:_col3::timestamp as etimestamp ,
$1:_col4::string as model ,
$1:_col5::number as accuraccy,
$1:_col6::number as batch_id ,
$1:_col4::date as event_date,
FROM DEVICEINFO_VAR;

Loading Data

Now that the database and table have been created, it’s time to load the 6 months of ORC data. Snowflake assumes that the ORC files have already been staged in an S3 bucket. I used the AWS upload interface/utilities to stage the 6 months of ORC data which ended up being 1.6 million ORC files and 233GB in size.

Create Stage Object

When setting up a stage you have a few choices including loading the data locally, using a Snowflake staging storage, or providing info from you own S3 bucket. I have chosen that latter as this will provide long term retention of data for future needs.

CREATE STAGE MULTI_COLUMN_DB"."POC".ORC_SNOWFLAKE
URL = 's3://ntgr-snowflake'
CREDENTIALS = (AWS_KEY_ID = 'xxxxxx' AWS_SECRET_KEY = '******');

Create File Format Objects

A named file format object provides a convenient means to store all of the format information required for loading data from files into tables.

CREATE FILE FORMAT "MULTI_COLUMN_DB"."POC".ORC_AUTO
TYPE = 'ORC'
COMPRESSION = 'AUTO';

Copy Data Into the Target Table

Execute COPY INTO table to load your staged data into the target table. Below is a code sample for both a Multiple Column Table and a Single Variant Column Table.

Multiple Column Table

copy into "MULTI_COLUMN_DB"."POC".DEVICEINFO FROM
(SELECT
$1:_col0::string, --id string
$1:_col1::string, --manufacturer string
$1:_col2::string, --serialnumber string
$1:_col3:: timestamp, --etimestamp string
$1:_col4:: string, --model string
$1:_col5:: number, --accuraccy string
$1:_col6::number, --batch_id bigint
date_trunc('DAY',$1:_col3::timestamp) --event_date date
from @ORC_SNOWFLAKE/tblorc_deviceinfo
(FILE_FORMAT=>ORC_AUTO) ON_ERROR = 'continue');

Single Variant Column Table

COPY INTO "SINGLE_VARIANT_DB"."POC”.DEVICEINFO_VAR
from @ORC_SNOWFLAKE/tblorc_deviceinfo
FILE_FORMAT = ORC_AUTO
ON_ERROR = 'continue';

Verify the Loaded Data

Execute a SELECT query to verify that the data was loaded successfully.

select * from "MULTI_COLUMN_DB"."POC".DEVICEINFO;
select * from "SINGLE_VARIANT_DB"."POC".DEVICEINFO;

Tableau Workbook Results

Once the the data was all loaded, I started Tableau and created a connection to Snowflake, and then set data source to point to the correct Snowflake database, schemas, and tables.

Establishing a Baseline

To establish a baseline for the query performance benchmarks, I’m using a medium sized warehouse for the first round of testing. It will be interesting to see how the initial query benchmarks compare to the current DW using that size. I expect the results to be similar.

Concurrency with Snowflake

In terms of concurrency testing for a warehouse, resizing alone will not address concurrency issues. For that, Snowflake’s multi-cluster warehouses are designed specifically for handling queuing and performance issues related to large numbers of concurrent users and/or queries.

Snowflake’s Unique Advantages

Working on this Snowflake project was a great experience. The Snowflake architecture was designed foundationally to take advantage of the cloud, but then adds some unique benefits for a very compelling solution and addresses the .

Final Thoughts

Data warehousing is rapidly moving to the cloud and solutions such as Snowflake provide some distinct advantages over legacy technologies as outlined above.

Need Snowflake Cloud Data Warehousing and Migration Assistance?

If you’d like additional assistance in this area, Hashmap offers a range of enablement workshops and consulting service packages as part of our consulting service offerings, and would be glad to work through your specifics in this area.


HashmapInc

Innovative technologists and domain experts helping accelerate the value of Data, Cloud, IIoT/IoT, and AI/ML for the community and our clients by creating smart, flexible and high-value solutions and service offerings that work across industries. http://hashmapinc.com

Hashmap

Written by

Hashmap

Innovative technologists and domain experts accelerating the value of Data, Cloud, IIoT/IoT, and AI/ML for the community and our customers http://hashmapinc.com

HashmapInc

Innovative technologists and domain experts helping accelerate the value of Data, Cloud, IIoT/IoT, and AI/ML for the community and our clients by creating smart, flexible and high-value solutions and service offerings that work across industries. http://hashmapinc.com