The Ungifted Amateur’s Guide to Snowflake

Where the Magic of Data Meets the Power of Snow!

I have been using Snowflake on a daily basis for the past couple of years, and I absolutely enjoy every minute I spend on the platform. I am also passionate about building Streamlit apps and have developed some fun projects in the past. Today, I’m excited to share my latest project with you: my personal Snowflake cheat sheet. Rather than building a mundane document, I’ve decided to create this cheat sheet as an interactive wonderland powered by the magic of Streamlit. After all, what better way to make a cool interactive cheat sheet than by harnessing the power of Streamlit? ❄️🤝🎈

I use this cheat sheet for my own work regularly, and I hope it will help you unlock the full potential of Snowflake in your awesome data projects as well.

🔗 check out the cheat sheet here:

https://snow-flake-cheat-sheet.streamlit.app/

As you venture into the world of Snowflake, its feature-rich ecosystem can be both exciting and overwhelming, especially for new users. The Ungifted Amateur’s Guide to Snowflake can serve as a quick reference guide designed to assist users — old and new alike — in their day-to-day workflow.

This companion blog post is not intended to provide a comprehensive walkthrough of all the capabilities Snowflake has to offer — the depth of Snowflake’s features is nothing short of astounding, and no single post can do justice to them all. Instead, I aim to provide a swift beginner’s guide to some of the most essential and commonly used objects within the Snowflake platform, with a focus on data science and data engineering workflows. Whether you’re dealing with tables, views, stages, or snowpipes, the cheat sheet and this companion article will help you grasp the basics and navigate your Snowflake journey with ease.

Let’s dive right in! 🤿

❄️ Snowflake

The cool place where all the data magic happens

Snowflake is a cloud-based full-stack data platform that allows users to securely store, manage, and analyze their data at scale. What makes Snowflake truly stand out is its architecture, which decouples storage and compute, providing unparalleled flexibility and scalability. This unique separation of storage and compute resources means that users can efficiently scale their data operations up or down without any hassle, all while maintaining the highest level of data security and access control.

Snowflake’s ease of use, reliability, and ability to support a wide range of data workloads make it a great choice for data engineers, data analysts, and data scientists worldwide. The platform simplifies data integration, supports diverse data formats, and enhances collaboration among teams by providing a unified and easy-to-use interface for managing, sharing, and analyzing data. It has also recently added the native capability to build data-powered apps using Streamlit directly on the platform. While this topic is outside the scope of this article, I encourage you to explore this new feature on your own and discover what you can do with your new superpower!

In the coming sections of this blog post, we’ll dive into the fundamental concepts that every Snowflake user should be familiar with. Let’s start with the most fundamental of them all!

🗄 Database

The gigantic storage drawer that holds many collections of your data together

In Snowflake, a database is a logical container for organizing and managing data effectively. It acts as a central repository for secure storage and manipulation of large volumes of data. With Snowflake’s databases, you can easily organize data into different collections or datasets based on your needs.

Databases in Snowflake offer advantages like logical separation of data sets, maintaining data isolation, and avoiding conflicts. This is particularly useful in multi-tenant environments. Additionally, Snowflake databases provide high security and access control. You can define granular permissions and access controls, protecting sensitive data and maintaining privacy. The architecture allows for separate storage and compute resources, optimizing query performance by allocating the right resources for specific databases or workloads.

Now let’s kick off our example by creating a database to serve as a central repository for all the magical data and artifacts we’ll be working with.

-- create a database to hold all the relevant artifacts for the wizarding world
CREATE DATABASE WIZARDING_WORLD;

Now let’s dive in and unlock the magic of data!

🗃 Schema

The individual drawers in your storage unit that help you organize your data

A schema in Snowflake is a way to organize and manage data effectively within a database. It acts as an individual drawer in your database storage unit, providing logical separation of objects within the database. Schemas are useful for categorizing and organizing data based on different criteria, such as departments, projects, or data types.

In Snowflake, you can create a schema by specifying its name and associating it with a database. Once created, you can use the schema as a container to hold tables, views, and other database objects. This provides a structured approach to organizing your data and makes it easier to locate and manage specific data sets.

For example, let’s create a schema called HOGWARTS within the WIZARDING_WORLD database:

-- Create a schema to organize the relevant Hogwarts artifacts in one place
CREATE SCHEMA WIZARDING_WORLD.HOGWARTS;

This makes it easier to manage and work with the data associated with Hogwartsstudents within the overall WIZARDING_WORLD database.

📊 Table

The big spreadsheet that holds all your data and organizes it in rows and columns

Tables in Snowflake are a fundamental component of the platform and serve as magical spreadsheets to store and organize your data in rows and columns. They provide a structured and organized way to store and query your data, making it easier to analyze and retrieve information.

Tables can hold a wide range of data, including numerical values, text, dates, semi-structured, geospatial, and more. See here for the full list of supported data types.

To illustrate how Snowflake tables are used, let’s create an example table for students in the magical world.

-- Create a table for Hogwarts students
CREATE TABLE WIZARDING_WORLD.HOGWARTS.STUDENT (
id INT PRIMARY KEY,
name VARCHAR,
main_characteristic VARCHAR,
start_date DATE,
end_date DATE,
house VARCHAR,
current_location VARCHAR
);

In this example, the STUDENT table will hold information about the students, such as their unique ID (primary key), name, main characteristic, start and end dates, house affiliation, and current location. This table allows you to efficiently store and manage the data, making it easy to query and analyze information about the students at Hogwarts.

Although manually insertion is not the most common way to populate a table in a data project, it can still be useful in certain situations. Here is how you can insert 5 new rows into the STUDENT table manually:

-- Insert five new students from different houses into the table
INSERT INTO WIZARDING_WORLD.HOGWARTS.STUDENT
(id, name, main_characteristic, start_date, end_date, house, current_location)
VALUES
(1, 'Harry Potter', 'Bravery', '1991-09-01', '1998-05-02', NULL, 'Gryffindor Common Room'),
(2, 'Hannah Abbott', 'Loyalty', '1991-09-01', '1998-05-02', 'Hufflepuff', 'Great Hall'),
(3, 'Padma Patil', 'Creativity', '1991-09-01', '1998-05-02', NULL, 'Ravenclaw Tower'),
(4, 'Draco Malfoy', 'Ambition', '1991-09-01', '1998-05-02', 'Slytherin', 'Slytherin Dungeon'),
(5, 'Siavash Yasini', 'Creativity', '2023-09-01', NULL, 'Ravenclaw', 'Library')
;

For larger datasets (which is basically any dataset these days), or automated data loading, it’s more efficient to use Snowflake’s data loading capabilities. In a later section, we will explore how to load data into a table automatically from a stage in batches using a COPY INTO command or incrementally using a PIPE.

🔎 View

The thing that lets you see a segment of your data in a special way

Sometimes, we want to see only a part of our data as is, or a transformed version of it. That’s where views come in. In Snowflake, views allow you to define a customized perspective of your data, showing only the information you need. They act as virtual tables that are based on the underlying data in other tables or views. Views can be used to simplify complex queries, hide sensitive data, or provide a specific subset of data for analysis or reporting purposes.

For example, let’s create a view called GREAT_HALL that shows all the students currently located at the Great Hall:

CREATE OR REPLACE VIEW GREAT_HALL AS
SELECT
CONCAT(name, ' from ', house) AS description
FROM
STUDENT
WHERE
current_location = 'Great Hall'
;

Now, as the underlying data changes, we can simply run a query against this view to always see the students who are at the Great Hall at any given time.

SELECT 
*
FROM
WIZARDING_WORLD.HOGWARTS.GREAT_HALL
;
Results returned by querying the GREAT_HALL view

When querying a view, the original query specified in the view definition is executed in the background. Therefore, repeatedly querying a view can be costly if the query is resource-intensive and complex. Additionally, if the underlying data remains unchanged, it is inefficient to rerun the same query and obtain identical results. But don’t worry, because materialized views take care of both these issues.

📸 Materialized View

The thing that captures a snapshot of your data and stores it (similar to a view, but more permanent)

A materialized view in Snowflake is a powerful tool that allows you to capture a snapshot of your data and store it as a separate object. Unlike regular views, which execute the underlying query each time they are queried, materialized views persist the result of the query. This means that the result is stored in the materialized view and can be accessed quickly without having to recalculate the query every time. This makes materialized views particularly useful for scenarios where the underlying data is relatively static, and the result of the query remains valid for a longer period of time.

To illustrate this with an example, suppose we want to retrieve a list of current Ravenclaw students from the WIZARDING_WORLD.HOGWARTS.STUDENT table. The query involves filtering based on the house, start date, and end date of the students. Since the result is likely to remain the same for an entire academic year, it makes sense to use a materialized view to store the result and avoid recalculating the query every time.

-- Create a materialized view to store the list of current Ravenclaw students
CREATE MATERIALIZED VIEW CURRENT_RAVENCLAW AS
SELECT
id,
name,
main_characteristic,
start_date,
end_date,
house,
current_location
FROM
STUDENT
WHERE TRUE
AND house = 'Ravenclaw' -- member of the Gryffindor house
AND DATEDIFF('year', start_date , '2023')::NUMERIC <= 7 -- started within the past 7 years
AND (end_date IS NULL OR end_date > '2023') -- doesn't have an end date or has not finished yet as of 2023
;

As cool as they are, materialized views have a lot of limitations. For example, in the above script, we are not able to use CURRENT_DATE() to get the current date since it’s a non-deterministic function. Hardcoding a date like I just did is also a bad practice and makes your results prone to errors in the future. You are also not allowed to use JOINs in materialized views, and they need to only query a single base table. Window functions, the HAVING clause, and the LIMIT clause are also beyond the limit—pun intended! If you need something like a materialized view that uses any of these, you may have to either set up tasks and streams that update a table regularly or use a DYNAMIC TABLE. However, the details of those are outside the scope of this blog post.

Okay, now let’s run a select query on our materialized view and see what we get.

SELECT 
*
FROM
CURRENT_RAVENCLAW
;
Results returned by querying the CURRENT_RAVENCLAW view

Hmm… Looks like I’m the only one here so far!

🚉 Stage

The platform where the data sits before moving in and out of Snowflake

Before data moves in and out of Snowflake, it sits on a stage. In Snowflake, a stage acts as a platform where data is temporarily stored, ready for loading or unloading. It serves as an intermediary step between your data source and Snowflake, allowing you to efficiently transfer data into and out of the platform.

Stages in Snowflake can be either internal or external. An internal stage is a location managed by Snowflake, while an external stage is a location outside of Snowflake (such as an Amazon S3 bucket, Azure Blob, or Google Cloud Storage). Both types of stages offer advantages depending on your specific use case.

A stage in Snowflake provides benefits such as allowing data to be loaded from various sources without directly moving it into a table, improving performance and flexibility. It also enables the storage of intermediate data files during data transformations or movements, useful for performing complex operations before loading into a table or temporary staging for further processing or analysis.

To illustrate how a stage is used in Snowflake, let’s create an example. We will create a stage called Platform 9¾ and add a CSV file of students to sit there while they’re waiting to be transferred to Hogwarts.

-- Create a stage called PLATFORM_9_3Q
CREATE OR REPLACE STAGE PLATFORM_9_3Q
file_format = (
type = 'CSV'
FIELD_DELIMITER = ','
)
;

There are various ways to put files on a stage. Here for simplicity we use the Snowflake UI for uploading files to the stage.

Snowflake UI for uploading a csv file

🚚 Copy Into

The thing you do to load data into Snowflake tables

There are two main ways to copy data into Snowflake tables from a stage: using the COPY INTO command for batch transfer and using PIPEs for incremental transfer, which is more suitable for streaming data.

In our use case, we will use the COPY INTO command to transfer data from the stage to our STUDENT table. Here is an example of how we can clear all the existing data and move new data into the table:

-- reomve all existing data from the student table
TRUNCATE TABLE STUDENT;
-- copy data from the stage into the student table
COPY INTO WIZARDING_WORLD.HOGWARTS.STUDENT
FROM @PLATFORM_9_3Q
FILES = ('hogwarts_students.csv')
FILE_FORMAT = (
TYPE=CSV,
SKIP_HEADER=1,
FIELD_DELIMITER=','
)
ON_ERROR=ABORT_STATEMENT
;

This COPY INTO command will load the data from the specified files in the PLATFORM_9_3Q stage into the STUDENT table in the HOGWARTS schema of the WIZARDING_WORLD database. The file format is specified as CSV, with a header row to be skipped and a comma , as the field delimiter. In case of any errors during the copy process, the command will abort the statement.

Using the COPY INTO command provides a straightforward and efficient way to transfer data in batches from a stage to a Snowflake table. Now let's check out the data we just loaded in the table.

SELECT 
*
FROM
STUDENT
LIMIT
10
;
Results returned by querying the head of the STUDENT table

🚰 Pipe

The magic portal that helps move data into Snowflake tables

An alternative to moving data into Snowflake tables is pipes, which provide a seamless and efficient way to load data from a stage into a table. Unlike the COPY INTO command, which is suitable for batch transfers, pipes are designed for incremental transfers, making them ideal for scenarios where data is continuously streaming or updating.

For example, let’s say we have daily visitor data for Hogwarts, and we want to avoid running the COPY INTO command on a daily basis. We can set up a pipe called 🚂 HOGWARTS_EXPRESS that continuously moves data from the PLATFORM_9_3Q stage into the VISITORS table:

-- Create a pipe to seamlessly transfer data from the stage into the table
CREATE PIPE HOGWARTS_EXPRESS
AS
COPY INTO WIZARDING_WORLD.HOGWARTS.VISITORS
FROM @PLATFORM_9_3Q
FILE_FORMAT = (
TYPE = CSV,
SKIP_HEADER = 1,
FIELD_DELIMITER = ','
)
;

With this pipe in place, any new data files added to the PLATFORM_9_3Q stage will be automatically loaded into the VISITORS table without the need for manual intervention. This ensures that the table is always up to date with the latest visitor information, providing real-time insights into the daily activities at Hogwarts.

🪜 Procedure

The thing that follows a set of steps to transform your data

When you need to perform a set of steps to manipulate your data, you can use a procedure in Snowflake. A procedure is a collection of SQL statements that automate a series of actions, bringing your data transformations to life. Procedures can be created, executed, and scheduled within the Snowflake platform, allowing you to easily automate repetitive tasks and ensure consistent data processing.

Procedures are particularly useful when you need to perform complex data operations or when you want to ensure consistent data quality and integrity. They allow you to encapsulate your logic into a reusable code block, making it easier to manage and maintain your data transformations.

Let’s illustrate the power of Snowflake procedures with an example. Imagine that you have a new batch of students who entered the WIZARDING_WORLD.HOGWARTS.STUDENT table, and since they're in their first year, their house information is all missing. To rectify this, we can write a procedure that performs the 🧙 SORTING_CEREMONY(), updating the house of each student based on their main characteristic.

-- Create a procedure to perform the Sorting Ceremony and update student houses
CREATE OR REPLACE PROCEDURE SORTING_CEREMONY()
RETURNS STRING
LANGUAGE SQL
COMMENT = 'Sort students into houses based on their main characteristic.'
AS $$
BEGIN
UPDATE STUDENT
SET HOUSE = (
CASE
WHEN MAIN_CHARACTERISTIC = 'Bravery' THEN 'Gryffindor'
WHEN MAIN_CHARACTERISTIC = 'Loyalty' THEN 'Hufflepuff'
WHEN MAIN_CHARACTERISTIC = 'Creativity' THEN 'Ravenclaw'
WHEN MAIN_CHARACTERISTIC = 'Ambition' THEN 'Slytherin'
END
);
RETURN 'Ta Da!';
END
$$
;

The procedure uses a SQL command to UPDATE the HOUSE column of the STUDENT table based on each student's main characteristic. The CASE statement is used to map each main characteristic to the corresponding house.

To execute the procedure, you can simply call it using:

CALL WIZARDING_WORLD.HOGWARTS.SORTING_CEREMONY();

This will trigger the Sorting Ceremony, updating the houses of the students in the WIZARDING_WORLD.HOGWARTS.STUDENT table based on their main characteristic. You can schedule the procedure to run at specific intervals (e.g. once a year at the beginning of the new school year) or call it manually whenever needed.

💡 Question for the observant reader:
As you may have noticed, this procedure updates the “house” column for the entire dataset. However, this would be overkill for large datasets, as all students typically have a determined house except for the first-year students. Can you think of a modification to this procedure that would only update the “house” column for the first-year students?

🚨 Alert

The thing that lets you know when your data has met certain criteria

Alert in Snowflake is a powerful feature that helps you stay informed about important events in your data. With alerts, you can set up notifications to trigger when specific conditions are met, allowing you to take timely action and address any issues or opportunities.

To set up an alert, you define the conditions that need to be met and specify the action to be taken when those conditions are satisfied. The conditions can be based on various factors such as data values, query results, or system events. The actions can include sending an email, executing a stored procedure, or invoking a webhook.

To illustrate this, let’s look at an example of an alert that Professor McGonagall has set up for Professor Dumbledore:

-- Create an alert named SEND_OWL
CREATE OR REPLACE ALERT SEND_OWL
WAREHOUSE = hogwarts_wh
SCHEDULE = '1 HOUR'
IF (
EXISTS (
SELECT *
FROM STUDENT
WHERE TRUE
AND student_name = 'Tom Marvolo Riddle'
AND current_location = 'Chamber of Secrets'
)
)
THEN
CALL SYSTEM$SEND_EMAIL(
'hogwarts_email',
'albus.dumbledore@hogwarts.edu',
'Owl Alert: Situation in the Chamber of Secrets',
'Attention: Tom Riddle was detected in the Chamber of Secrets. Please take appropriate action.'
)
;

Now, let’s activate it so Dumbledore can monitor the data for any suspicious activities at the Chamber of Secrets, especially those of the person who shall not be named!

ALTER ALERT SEND_OWL RESUME;

Alerts in Snowflake provide a powerful mechanism for proactive monitoring and notification of important events in your data. By setting up alerts, you can stay on top of critical conditions and respond promptly to any changes or occurrences that require attention. For example, alerts can be used to detect data drifts, unexpected missingness in certain fields, sudden changes in data patterns, or anomalies in data distribution.

🦉 Conclusion

In this blog post, we explored the various functionalities of Snowflake, such as databases, schemas, tables, views, materialized views, stages, and more. Each of these objects plays a critical role in organizing, managing, and analyzing data within Snowflake.

As demonstrated in this post, Snowflake proves to be a powerful ally in any data project, providing a versatile environment for end-to-end workflows. Now, I encourage you to delve into the details of these objects and more in the “Ungifted Amateur’s Guide to Snowflake” app. There, you can learn about dynamic tables, user-defined functions, data manipulation, tasks, streams, and other exciting features.

Embark on your own data adventure with Snowflake and unlock the potential of your data projects!

Stay tuned for more exciting content and insights from the Ungifted Amateur!

I am incredibly grateful to my amazing Snowflake mentor, Sang Hai, who is always sharing his extensive knowledge about the exciting and innovative features of Snowflake and guiding me in implementing them in my work. I would also like to express my heartfelt appreciation to Kathryn Reck Harris and Varun Chavakula, my awesome Snowflake buddies, who always share the exhilarating ride of exploring Snowflake and provide invaluable insights and support.

Lastly, a special thanks to Jessica Smith, a true champion of the Streamlit @ Snowflake platform, for always encouraging me to create fun things in Streamlit and for her continuous support within the vibrant Streamlit community.

--

--

Siavash Yasini
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Ungifted Amateur, Python Enthusiast, Latte Artist, Ex-Cosmologist, Sr Data Scientist @ Fanatics