First Data Warehouse Experience with Snowflake

# Tutorial: Using sample datasets on QueryPie

Published in

QueryPie

6 min readJan 22, 2020

It’s not surprising that 2025 will be the age of data. But what’s really amazing is the amount of data. IDC reports 163ZB of data will be generated in 2025. In an era with so much data to analyze, how do we create the best data analytics environment?

Seven or eight years ago, the company I worked at built a Hadoop Cluster on hundreds of machines and used Hadoop as a data store. We then connected Hive to Hadoop and extracted the data from the Hive Query. But loading and analyzing hundreds of gigabytes of data took quite a while, and we had to continue managing the cluster as the size of data continued to grow.

What is Snowflake?

Snowflake is the new leader in the data warehouse industry that has emerged to solve the challenges of this data analytics environment. Founded in 2013, the company emerged from a two-year stealth development mode in 2014 and quickly encroached on the data warehouse market with the keyword data warehouse as a service. According to Datanyze, Snowflake is ranked fourth in the Data Warehouse market, beating traditional database companies like Oracle and IBM.

The core value Snowflake delivers is the ability to continuously scale data without requiring a separate D/W administrator by eliminating all administrative elements. Best of all, all data access is based on SQL, the standard language in the data space.

In the end, Snowflake provided a fully SaaS-based approach to building, scaling and managing a D/W for the data analytics process. Because they were able to focus on the inherent capabilities of data analytics, they quickly domineered the market.

Now let’s look at how Snowflake uses SQL to analyze your data.

Getting Started with Snowflake — (trial account)

First, you need to create an account to access Snowflake. You can create a trial account on the page below.

Snowflake (Free trial): https://trial.snowflake.com/

Once you’ve created an account and logged in, you’ll find a basic sample data set. Let’s use this data set to run some easy queries. Once connected to the SNOWFLAKE_SAMPLE_DATA database, you’ll see schemas that start with TPC-H and TPC-DS. Let’s use the data from the TPCH schema, which is composed of customer data.

The difference between built-in schemas is basically the data size difference (the number on the end). Because Snowflake often handles a vast amount of data, it appears that we have these datasets in place to test the difference between different amounts of data.

TPC-H comes with various data set sizes to test different scaling factors. For demonstration purposes, we’ve shared four versions of the TPC-H data. The data is provided in the following schemas in the
SNOWFLAKE_SAMPLE_DATA shared database:
- TPCH_SF1: Consists of the base row size (several million elements).
- TPCH_SF10: Consists of the base row size x 10.
- TPCH_SF100: Consists of the base row size x 100 (several hundred million elements).
- TPCH_SF1000: Consists of the base row size x 1000 (several billion elements).

(Reference: https://docs.snowflake.net/manuals/user-guide/sample-data-tpch.html#database-and-schemas)

Now let’s start with a simple query using a sample data set.

Sample Query: Pricing Summary Report Query

Let’s say we want to execute a query that summarizes the price information of all goods shipped on a specific date (within 60–120 days). To get this information, we can use returnflag and linestatus to get:

total quantity, price, discount price, additional tax
average quantity, price, and discounted price

These queries can also be recalled immediately by clicking on the drop-down menu at the top of the worksheet tab. (Tutorial 1: Sample queries on TPC-H data)

use schema snowflake_sample_data.tpch_sf1;
-- or snowflake_sample_data.{tpch_sf10 | tpch_sf100 | tpch_sf1000}select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1-l_discount)) as sum_disc_price,               sum(l_extendedprice * (1-l_discount) * (1+l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
        lineitem
where
        l_shipdate <= dateadd(day, -90, to_date('1998-12-01'))
group by
        l_returnflag,
        l_linestatus
order by 
        l_returnflag,
        l_linestatus;

Run queries within the Snowflake Web Interface

Because the query is provided by snowflake as a sample query, the worksheets tab can be called up automatically and conveniently executed. However, when comparing the structure of the table, columns, or object items, it’s a little inconvenient in its UI.

So let’s take a look at how easy it is to connect to Snowflake through QueryPie and quickly write SQL as well as easily view table information.

Snowflake Sample Query on QueryPie

(1) Useful Object Panel

The inconvenience with querying in the Snowflake interface is largely in part due to not being able to write a query by looking at the information in the columns required in the table.

For example, let’s say we want to limit queries to a specific date and check the most recent sent date of l_shipdate. Snowflake’s Web Interface allows the user to view the table information through Data Preview, but because it limits the number of data to the first 100 rows, it is difficult to find the results we want.

Limited rows in Data preview — Snowflake Web Interface

On the other hand, QueryPie makes it easy to query information through the object panel on the right when you click on the table. And because there is no limit on the number of rows, you can quickly apply a filter to see the maximum/minimum value of the column you want.

(2) Executing Multi-Queries at once

Another inconvenience with the Snowflake Web Interface is that when running multiple queries, only the result of the last syntax is visible in the window. This is really inconvenient when running multiple queries at the same time.

Thankfully, QueryPie allows you to query individual query results simultaneously in the results panel. For example, suppose you look up the results in a comparison based on a particular variable. The Web Interface allows you to run individual queries and then view the results for them one at a time. But when using QueryPie, you can run two separate queries and view both results at the same time.

You can run queries simultaneously in the Snowflake Run window.

But when running a concurrent query, only the result statements of the last query are available.

Another example: if you run the query below, running the query itself is simultaneously possible within the Web Interface, but you cannot view the results at once. On the other hand, QueryPie makes it easy to see results at a glance because it allows you to compare results queries to individual queries at the same time.

USE SCHEMA "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF001";# Result based on TOTAL
SELECT C.C_CUSTKEY, COUNT(O.O_ORDERKEY) AS NUMBER, SUM (O.O_TOTALPRICE) AS TOTALFROM CUSTOMER C
INNER JOIN ORDERS O ON C.C_CUSTKEY=O.O_CUSTKEY
GROUP BY C.C_CUSTKEY
ORDER BY TOTAL DESC;# Result based on NUMBER
SELECT C.C_CUSTKEY, COUNT(O.O_ORDERKEY) AS NUMBER, SUM (O.O_TOTALPRICE) AS TOTAL
FROM CUSTOMER C
INNER JOIN ORDERS O ON C.C_CUSTKEY=O.O_CUSTKEY
GROUP BY C.C_CUSTKEY
ORDER BY NUMBER DESC;

QueryPie allows you to view the results of multiple queries at the same time

As such, QueryPie makes it easier and faster to run SQL than the Snowflake Web Interface. It also provides a convenient and easy way to look at the table. For those of you who had an uncomfortable Snowflake Web Interface environment, try using QueryPie to take a look at your Snowflake data!

📌 Download QueryPie: https://www.querypie.com/download
📌 See how to connect Snowflake on QueryPie: http://bit.ly/querypie_dw_updates