Snowflake Cliff Notes: COPY INTO

(This is one of a series of quick read articles outlining some functionality in Snowflake.)

Zakary LeBlanc
5 min readApr 19, 2023
Photo by Kelly Sikkema on Unsplash

COPY INTO is one of my favorite and most used Snowflake SQL commands for loading and unloading data. It is made up of two sister functions:

  1. COPY INTO <table> — Load data into a table from a staged (internal or external) file
  2. COPY INTO <location> — Load data into a staged (internal or external) file from a table

Let’s take a look at the definition provided in the Snowflake Documentation:

Loads data from staged files to an existing table. The files must already be staged in one of the following locations:

Named internal stage (or table/user stage). Files can be staged using the PUT command.

Named external stage that references an external location (Amazon S3, Google Cloud Storage, or Microsoft Azure).

External location (Amazon S3, Google Cloud Storage, or Microsoft Azure).

Unloads data from a table (or query) into one or more files in one of the following locations:

Named internal stage (or table/user stage). The files can then be downloaded from the stage/location using the GET command.

Named external stage that references an external location (Amazon S3, Google Cloud Storage, or Microsoft Azure).

External location (Amazon S3, Google Cloud Storage, or Microsoft Azure).

This is kind of a big deal… you can connect to blob storage (a.k.a. your data lake) and pull information into a table in Snowflake with a simple SQL statement. No pipelines need to be built and the data does not need to pass through any intermediary servers to get into Snowflake. This means you are less likely to run into random network irregularities that can cause pipelines to break.

The COPY INTO <location> command can automatically split your table’s data into separate files based on a size limitation or it can create a single large file. Unless you have a reason not to, I highly recommend allowing Snowflake to split the data into multiple files. It is crazy fast. For example, I have unloaded 200+ million rows of data in around 2 minutes this way to an external stage using an X-Small warehouse.

The reverse is also true as it is extremely fast to load multiple files into a target table using the COPY INTO <table>. Snowflake opens up multiple threads to process the information faster in parallel. Best practice is to try to keep your files between 100–250 MB in size. Larger file sizes will run into single thread performance constraints.

COPY INTO Syntax

So let’s take a look at the simplest form of the two statements:

COPY INTO MY_TABLE 
FROM @MY_STAGE;

COPY INTO @MY_STAGE
FROM MY_TABLE (or SELECT … FROM MY_TABLE);

Simple right? Of course these above statements only work if you have a number of pre-requisites set up. To fully take advantage of these commands it is best to have a Stage (either internal or external) and one or more file formats created. It is beyond the scope of this article to discuss the set up of a stage as well as how to push your files to your stage.

While you can specify the file format details as part of the COPY INTO command, if you intend to use the same format for multiple loads then it makes more sense to store these details in a File Format. This format can be directly associated with the stage itself so no format is even needed in the COPY INTO command. This can make things easier if you only expect to receive files in a single format. Since the COPY INTO commands are closely related to FILE FORMATs let’s take a look at how to create one.

FILE FORMAT

Use the CREATE FILE FORMAT command to have Snowflake store your file format options for repeat use. Here’s one format that I currently use for generic CSV files with no header that I currently use (you can also change SKIP_HEADER = 1 to handle CSVs with a header row):

CREATE FILE FORMAT CSV
TYPE = 'CSV'
COMPRESSION = 'AUTO'
FIELD_DELIMITER = ','
RECORD_DELIMITER = '\n'
SKIP_HEADER = 0
FIELD_OPTIONALLY_ENCLOSED_BY = '\042'
TRIM_SPACE = FALSE
ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE
ESCAPE = 'NONE'
ESCAPE_UNENCLOSED_FIELD = '\134'
DATE_FORMAT = 'AUTO'
TIMESTAMP_FORMAT = 'AUTO'
NULL_IF = ('NULL');

Here is how the above file format would fit into a COPY INTO statement:

COPY INTO MY_TABLE
FROM @MY_STAGE
FILE_FORMAT = (FORMAT_NAME = ‘CSV’);

File Formats are not constrained to only CSV formats. Snowflake supports JSON, XML, AVRO, ORC, and Parquet base formats as well.

Caveats and Examples

As with anything there are a few important caveats.

  1. The COPY INTO <table> command only works with existing tables. Additionally, the file’s fields need to match, in order, the fields in the table.
  2. A COPY INTO <table> command loading data into a table with non-text data types (i.e. Number, Date, etc.) will have errors if blank values are entered (i.e. you cannot insert “” into a DATE field)

With this in mind, the simplest design pattern I have found is to create a staging or intake table that has only varchar data types to avoid receiving any type mismatches. For example:

CREATE OR REPLACE LOAD_TEST (
COL1 VARCHAR(100),
COL2 VARCHAR(100),
COL3 VARCHAR(100),

COLN VARCHAR(100)
);

On top of this table we can then create a view to expose the table’s appropriate data types:

CREATE OR REPLACE LOAD_TEST_V AS (
SELECT TRY_TO_NUMBER( COL1, 38,0) AS COL1,
CAST(NULLIF(COL2, ‘’) AS VARCHAR(50)) AS COL2,
TRY_TO_DATE(COL3) AS COL3,

CAST(NULLIF(COLN, ‘’) AS VARCHAR(75)) AS COLN
FROM LOAD_TEST);

Unloading data from Snowflake is relatively painless. One thing to keep in mind, if you allow Snowflake to split up the table data into multiple files each file will have X_X_X appended to its name where each X is an incrementing number. These numbers correspond to the <partition>_<thread>_<file sequence>.

Here is an example that unloads data:

COPY INTO @MY_STAGE/unload_data_ 
FROM MY_TABLE
FILE_FORMAT = (FORMAT_NAME = ‘CSV’);

This example creates files prepended with “unload_data_” (i.e. unload_data_0_0_0.csv, unload_data_0_0_1.csv, etc.) in the stage.

How does this compare to Snowpipe?

In case you are not already familiar, Snowpipe allows for data to be automatically imported into a target table as soon as files are available on a stage. A Snowpipe is an object that is associated with a COPY command — basically an automated COPY INTO task. It takes advantage of Snowflake’s server-less architecture, which reduces the administrative overhead of managing virtual warehouses by allowing Snowflake to manage it for you.

Now this may sound great to some of you, but it will come at a price. Snowflake has a vested interest in helping you burn through your credits. Allowing them to manage anything compute related is asking for surprise bills.

Conclusion

The COPY INTO command is a great way to move data into and out of Snowflake. The above article only dips its toes into the statement and its many options and permutations. I would recommend a deeper dive into these options if you are planning to do any serious loading/unloading.

Thank you for reading! Please leave some feedback and follow if you enjoy talking about Snowflake or SQL topics.

--

--

Zakary LeBlanc

Bloom Analytics founder and chief data architect. Enthusiastic about modern data platforms - especially Snowflake and passionate about writing good SQL.