Loading data using Snowflake

Natana Pedroso
Indicium Engineering

--

In this article, we explore the options for loading data into Snowflake and the data formats that it supports.

At the beginning of a data project, the data is not always ready and available and it is necessary to upload this data to the cloud. In Snowflake, the location of data in cloud storage is known as Stage, which can be classified as External Stages (external storage providers can be AWS, GCP, or Azure) and Internal Stages.

External Stages

When the data is not yet in Snowflake, it is necessary to create External Stages. In this way, we inform where the data is made available. Assuming the data is hosted on AWS, the creation process would be as follows:

Creating an External Stage in AWS S3.

After uploading the data files to the designated location in the Snowflake External Stage, proceed to transfer the data into a Snowflake table by executing the COPY INTO command in SQL. Utilize the FROM clause within the COPY INTO command to designate the external stage as the data source.

Internal Stages

Once the data is already hosted in Snowflake, there are three stage types into account. The first is the User Stage, where each user is assigned to store their files.

This specific stage is made for holding files staged and managed by an individual user. However, they can be loaded into multiple tables. User stages cannot be modified or removed once created.

For every table created in Snowflake, there is an available Table Stage. This stage is specifically designed to store files staged and managed by one or multiple users; however, they are exclusively loaded into a single table. Once established, table stages cannot be modified or removed

A Named Stage is a database object established within a schema. This stage is capable of storing files managed by one or multiple users, which can then be loaded into one or more tables. They are classified as database objects and are created using the CREATE STAGE command.

Internal stages are referenced by the LIST and the following commands:

Internal Stages reference commands.

Bulk loading

In Snowflake, data loading can be achieved through various methods, each made for specific use cases and preferences. Some common data loading types in Snowflake include bulk and continuous loading.

Bulk allows the ingestion with small volumes of data from files stored in cloud storage, or the transfer of data files from a local machine to an internal cloud storage area (Snowflake) before loading the data into tables using the COPY instruction.

It is possible to load data from various file formats such as CSV, JSON, Avro, Parquet, ORC, and so on, directly into tables using the COPY INTO table command. This method is efficient for large-scale data loading.

Snowflake supports transformation data while loading data into a table using the COPY INTO command such as column reordering, column omission, truncating text string that exceed the target column length, and casts.

Bulk Loading costs.

Continuous Loading using Snowpipe

This method enables near-real-time data ingestion by continuously loading data from files stored in cloud storage, such as Google Cloud Storage, Amazon S3 or Azure Blob Storage into Snowflake tables.

It is designated to small-size data (micro batches) and is particularly beneficial when we require quick access to fresh data, such as real-time analytics and data driven decision-making.

The COPY statement within a pipe definition offers identical transformation options for bulk loading.

Continuous Loading using Snowpipe costs.

Continuous loading using Snowpipe Streaming

In this case, the API writes rows of data directly into tables with no need for stage files. This results in low latency with lower costs to load any size of data, so organizations can achieve near-real-time data analytics and make timely decisions based on the latest streaming data updates.

Some examples of use cases are monitoring IoT devices, financial transactions, and social media data analysis.

Continuous Loading using Snowpipe Streaming costs.

By learning all the concepts presented in this article and after understanding how to bring and query the data, it is possible to determine the next steps to transform data within cloud storage.

Snowflake external tables serve as a connection point between Snowflake and externally stored data, facilitating easy access and streamlined data management without the need to physically relocate the data.

Each of these data loading methods has its advantages and is indicated for different scenarios depending on factors such as data size, frequency of updates, real-time requirements, and infrastructure.

Get to know Indicium

Would you like to learn more about the applications of Snowflake? Get to know Indicium! We have the best data-based solutions.

And If you want to learn how to use Snowflake, check out Indicium Academy.

There, we offer courses that will enhance your data routine, as we always aim to provide instruction on the best tools and most modern skills.

In the Analytics Engineering Program, for example, you’ll learn to use Snowflake along with other data tools like dbt and Power BI.

Visit the Indicium and Academy websites to explore our solutions and products.

References

https://docs.snowflake.com/en/user-guide/data-load-overview

https://www.chaosgenius.io/blog/snowflake-external-tables/

--

--

Natana Pedroso
Indicium Engineering

Analytics Engineer | dbt | ELT | Snowflake | Databricks | BI