Snowflake Snowpro Core Preparation Part 13 — Data Loading & Copy Into, Snowpipe,Data Unloading & Copy Into

12 min readJun 5, 2024

In this blog, we are going to discuss about Data loading into Snowflake — Using PUT & Copy Into<table>, Copy Options, Data Loading best practises, Snowpipe, Data Unloading from Snowflake — Using Copy Into<location> & GET.

Data Loading Types:

● Batch processing — Automatically processing a large amount of data(Bulk) that has accumulated over a specified period.

● Micro-Batch processing — Near real time data loading. Accumulate small amounts of data and load it in 60 seconds intervals. The time between loading each micro-batch could be greater than 60 seconds but is usually no more than a few minutes between loads.

● Stream/Real time/Continuous processing — Process data in real time. Streaming tools such as Kafka and Snowpipe can be used to stream the data into snowflake.

Data Loading Tools:

● Direct Loading — INSERT INTO commands.

● Snowflake’s Web UI — For loading small files — Less than 50 MB.

● SnowSQL — PUT & COPY INTO commands.

● Snowpipe — For continuous data loading. Can be used in conjunction with Kafka topics as well.

● Data Pipeline — Complex transformations. Snowpipe -> Staging -> Streams & Tasks -> Table.

Data Loading In Snowflake(All Options):

PUT Command:

● Uploads data files from a local file system to one of the following Snowflake stages:

> A named internal stage.
> A specified table’s internal stage.
> The current user’s internal stage.

● PUT does not support uploading files to external stages.

● Duplicates are ignored.

Upload the file mydata.csv to internal stage my_int_stage

Wild card to upload multiple files to the table stage orderstiny_ext

COPY INTO <table>:

● Loads data from staged files to an existing table. The files must already be staged in one of the following locations:.

> Named internal stage (or table/user stage). Files can be staged using the PUT command.
> Named external stage that references an external location (AWS S3, GCS, or Azure Blob).
> External location (AWS S3, GCS, or Azure Blob).

● Snowflake supports transforming data while loading it into a table using the COPY command:

> Column reordering.
> Column omission.
> Casts.
> Truncating text strings that exceed the target column length.

● Requires virtual Warehouse.

COPY INTO <table> Examples:

COPY Options :

Examples:

Purge

COPY INTO <table> VALIDATION_MODE:

● String (constant) that instructs the COPY command to validate the data files instead of loading them into the specified table.

● COPY command tests the files for errors but does not load them.

● The command validates the data to be loaded and returns results based on the validation option specified.

Examples:

COPY command in validation mode and see all errors

COPY command in validation mode for a specified number of rows:

Data Loading Best Practices:

● File Sizing — The number of load operations that run in parallel cannot exceed the number of data files to be loaded.

> To optimize the number of parallel operations for a load, aiming to produce data files roughly 100–250 MB (or larger) in size compressed.

> Aggregate smaller files to minimize the processing overhead for each file

> Split larger files into a greater number of smaller files to distribute the load among the compute resources in an active warehouse.

> Semi-structured data — A VARIANT can have a maximum size of up to 16 MB of uncompressed data.

If the data exceeds 16 MB, enable the STRIP_OUTER_ARRAY file format option to remove the outer array structure.
Snowflake extracts a maximum of 200 elements per partition, per table.

> Snowpipe — File size — 100–250 MB or larger. A good balance between cost (i.e., resources spent on Snowpipe queue management and the actual load) and performance (i.e., load latency).

1 Minute interval — Keeping the buffer interval setting at 60 seconds (the minimum value) helps avoid creating too many files or increasing latency.

● Load Plan — Dedicated separate warehouses for loading and querying operations to optimize performance for each.

● Staging Data — When staging regular data sets, partitioning the data into logical paths that include identifying details such as geographical location or other source identifiers, along with the date when the data was written.

> Organizing your data files by path lets you copy any fraction of the partitioned data into Snowflake with a single command.

> This allows you to execute concurrent COPY statements that match a subset of files, taking advantage of parallel operations.

● Loading Data — This load metadata expires after 64 days. If the LAST_MODIFIED date for a staged data file is less than or equal to 64 days, the COPY command can determine its load status for a given table and prevent reloading (and data duplication).

> To load files whose metadata has expired, set the LOAD_UNCERTAIN_FILES copy option to true.

> In a VARIANT column, NULL values are stored as a string containing the word “null,” not the SQL NULL value — Recommendation — Setting the file format option STRIP_NULL_VALUES to TRUE.

> If external software exports fields enclosed in quotes but inserts a leading space before the opening quotation character for each field, Snowflake reads the leading space rather than the opening quotation character as the beginning of the field — Use the TRIM_SPACE file format option to remove undesirable spaces during the data load.

● Managing Regular Data Loads:

> Partitioning staged data files — When planning regular data loads such as ETL processes or regular imports of machine-generated data, it is important to partition the data in internal stage or external locations using logical, granular paths.

> Removing loaded data files — When data from staged files is loaded successfully, consider removing the staged files to ensure the data is not inadvertently loaded again (duplicated) & Reduces the number of files that COPY commands must scan.

> Files that were loaded successfully can be deleted from the stage during a load by specifying the PURGE copy option in the COPY INTO <table> command.

üAfter the load completes, use the REMOVE command to remove the files in the stage.

All the details about Data loading into Snowflake are available in this youtube Video:

Snowpipe

● Snowpipe enables loading data from files as soon as they’re available in a stage.

● Load data from files in micro-batches, making it available to users within minutes, rather than manually executing COPY statements on a schedule to load larger batches.

● Snowpipe uses serverless compute model, users can initiate any size load without managing a virtual warehouse.

● The data is loaded according to the COPY statement defined in a referenced pipe.

● A pipe is a named, first-class Snowflake object that contains a COPY statement used by Snowpipe.

● The COPY statement identifies the source location of the data files (i.e., a stage) and a target table.

● All data types are supported, including semi-structured data types such as JSON and Avro.

● Automating Snowpipe using cloud messaging –

> Automated data loads leverage event notifications for cloud storage to inform Snowpipe of the arrival of new data files to load.
> Snowpipe polls the event notifications from a queue.
> By using the metadata in the queue, Snowpipe loads the new data files into the target table in a continuous, serverless fashion based on the parameters defined in a specified pipe object.

● Automating Snowpipe using cloud messaging Example –

● Calling Snowpipe REST endpoints –

> Client app calls a public REST endpoint(insertFiles) with the name of a pipe object and a list of data filenames.

> If new data files matching the list are discovered in the stage referenced by the pipe object, they are queued for loading.

> Snowflake-provided compute resources load data from the queue into a Snowflake table based on parameters defined in the pipe.

Snowpipe VS Bulk Loading:

Snowpipe Billing:

● Snowflake provides and manages the compute resources, automatically growing or shrinking capacity based on the current Snowpipe load.

● Accounts are charged based on their actual compute resource usage. Not like VWH, which consume credits when active.

● Snowflake tracks the resource consumption of loads for all pipes in an account, with per-second/per-core granularity, as Snowpipe actively queues and processes data files.

● Per-core refers to the physical CPU cores in a compute server.

● The utilization recorded is then translated into familiar Snowflake credits, which are listed on the bill for your account.

● In addition to resource consumption, an overhead is included in the utilization costs charged for Snowpipe — Charged 0.06 Credits per 1000 files.

● Estimating Snowpipe charges — Given the number of factors that can differentiate Snowpipe loads, it is very difficult for Snowflake to provide sample costs. Experiment by performing a typical set of loads to estimate future charges.

● Snowpipe utilization can be viewed using –

> PIPE_USAGE_HISTORY table function in information schema

> PIPE_USAGE_HISTORY view (in Account Usage).

**Snowpipe cost history (by day, by object) over the last 30 days, broken out by day**

**Average daily credits consumed by Snowpipe grouped by week over the last year — Helps to identify anomalies**

Snowpipe Streaming API:

● Calling the Snowpipe Streaming API (“API”) prompts the low-latency loading of streaming data rows using the Snowflake Ingest SDK and your own managed application code.

● This API removes the need to create files to load data into Snowflake tables and enables the automatic, continuous loading of data streams into Snowflake as the data becomes available.

● This architecture results in lower load latencies with lower costs for handling real-time data streams.

● The API is intended to complement Snowpipe, not replace it.

● Use the Snowpipe Streaming API in streaming scenarios where data is streamed via rows (for example, Apache Kafka topics) instead of written to files.

● The API fits into an ingest workflow that includes an existing custom Java application that produces or receives records.

● API ingests rows through one or more channels.

● A channel represents a logical, named streaming connection to Snowflake for loading data into a table.

● A single channel maps to exactly one table in Snowflake; however, multiple channels can point to the same table.

● The Client SDK can open multiple channels to multiple tables; however, the SDK cannot open channels across accounts.

Snowpipe VS Snowpipe Streaming:

Snowpipe Streaming API Kafka Example:

All the details about Snowpipe are available in this Youtube Video:

Data Unloading:

● Snowflake supports bulk export (i.e., unload) of data from a database table into flat, delimited text files.

● Bulk Unloading process:

> Step 1: Use the COPY INTO <location> command to copy the data from the Snowflake database table into one or more files in a Snowflake or external stage.

> Step 2: Download the file from the stage –

From a Snowflake stage, use the GET command to download the data file(s).
From cloud storage, use the interfaces/tools provided by CSP to get the data file(s).

● Bulk unloading using queries:

> Snowflake supports specifying a SELECT statement instead of a table in the COPY INTO <location> command.

> The results of the query are written to one or more files as specified in the command and the file(s) are stored in the specified location (internal or external).

● Bulk unloading into single or multiple files:

> The COPY INTO <location> command provides a copy option (SINGLE) for unloading data into a single file or multiple files.

> When unloading data into multiple files, use the MAX_FILE_SIZE copy option to specify the maximum size of each file created.

● Partitioned data unloading:

> The COPY INTO <location> command includes a PARTITION BY copy option for partitioned unloading of data to stages.

> Transform data for output to a data lake.

> Partitioning unloaded data into a directory structure in cloud storage can increase the efficiency with which third-party tools consume the data.

Data Unloading From Snowflake(All Options):

COPY INTO <location>:

● Unloads data from a table (or query) into one or more files in one of the following locations.

> Named internal stage (or table/user stage). The files can then be downloaded from the stage/location using the GET command.

> Named external stage that references an external location (Amazon S3, Google Cloud Storage, or Microsoft Azure).

> External location (Amazon S3, Google Cloud Storage, or Microsoft Azure).

COPY INTO <location> Examples:

COPY Options:

GET Command:

● Downloads data files from one of the following internal stage types to a local directory or folder on a client machine:

> Named internal stage.

> Internal stage for a specified table.

> Internal stage for the current user.

● Use this command to download data files after unloading data from a table onto a Snowflake stage using the COPY INTO <location> command.

● GET does not support the following actions:

> Downloading files from external stages. To download files from external stages, use the utilities provided by your cloud service.

> Downloading multiple files with divergent directory paths. The command does not preserve stage directory structure when transferring files to your client machine.

● The command cannot be executed from the Worksheets

● The command does not rename files.

Examples:

Download all files in the stage for the mytable table to the /tmp/data local directory

Download files from the myfiles path in the stage for the current user to the /tmp/data local directory

Unloading Relational Table To JSON:

● Use the OBJECT_CONSTRUCT function combined with the COPY command to convert the rows in a relational table to a single VARIANT column and unload the rows into a file.