HOW TO UPLOAD DATA FROM AWS S3 TO SNOWFLAKE IN A SIMPLE WAY

Nice snow and nice view

In current days, importing data from a source to a destination usually is a trivial task. With a proper tool, you can easily upload, transform a complex set of data to your data processing engine. For example, to add data to the Snowflake cloud data warehouse, you may use ELT or ETL tools such as Fivetran, Alooma, Stich or others. Snowflake itself also offers a Load Data Wizard to help you ingest data. These tools are advanced and sometimes require a learning curve and practice. Most importantly, these tools are very effective so, you will pay for such convenience. But what if you need to simply glance into some file and upload it to do a preview? Perhaps you may not need the full features of a Swiss knife when a simple razor will do the job. With this point of view I decided to take a lighter weight approach to create a prototype to ingest data from your local PC or AWS. This application needs to know how to read a file, create a database table with appropriate data type, and copy the data to Snowflake Data Warehouse.

I started with CSV. I estimated my project would take half a day if I could find a proper library to convert the CSV structure to an SQL table. I spent a few hours Googling for blogs, stackoverflow-ing and checking different libraries including Java, Python, C#. I was very surprised in my findings and could see that many others also were looking for such solutions. This post, describes many different approaches with CSV files, starting from Python with special libraries, plus Pandas, plus PySpark, and still, it was not a perfect solution. This is just a simple project to show that it is possible to create your own CSV, Parquet ‘importer’. It may not cover ALL (100%) scenarios in CSV, but we can improve it later. To run this application you need Java (most recent version) and a Snowflake account. Go here to get a free trial account .

Importing Data into Snowflake Data Warehouse

There are many ways to import data into Snowflake. One way is using the Snowflake Wizard. The wizard is a simple and effective tool, but has some limitations. I prefer an easier, “developer” approach like running a short command line and checking the result; others may prefer the wizard to walk them through the process.

Here are the process steps for my project: point to CSV, Parquet file, read the Header, create a destination SQL table schema in Snowflake DB and populate the table.

Immediately I’m faced with a simple question: Should a file have a Header? I’m guessing, this question is for ML/AI task, and maybe in the future, it will be in my queue to resolve. For now, it is a simple decision. And so I should let the application know about the header presence tweaking the config file. Also, I’ll add a line into the config for a default column name if there is no header in the file. Since Snowflake supports full ANSI SQL and SQL types, for this demonstration, I can process the following types: INTEGER, FLOAT, BOOLEAN, DATE, VARCHAR, TIMESTAMP. In a configuration file, you may specify how many rows you’d like to process to evaluate data types.

Executing the Application

The application will connect to your Snowflake account reading all properties from the config file. Then the app will create a table in your selected Database/Schema location with your file name as the table name. Next, it will create a temporary Stage to copy a file to an intermediate location.

put file:////”+paramsInCSV.get(“fileNameWithPath”)+” @~/staged

If your file has a Date column it will be added as Date Format for a Copy process.

(“Copy into “+paramsInCSV.get(“fileName”)+” from @~/staged file_format = (type = csv field_delimiter = ‘“+paramsInCSV.get(“delimiter”)+”’ skip_header = 1 FIELD_OPTIONALLY_ENCLOSED_BY=’\”’ NULL_IF = (‘’,’NULL’, ‘null’, ‘\\N’) EMPTY_FIELD_AS_NULL = true “+sDateFormatSql+”);”);

This line is very important and helps a lot in the import process. The delimiter is auto applied, skipping the header parameter, and if needed the data in quotes are cleared from them, checking for what values use NULL, and specifying the date format.

To copy CSV or CSV.gz data from AWS S3 we need to create an External Stage that would point to S3 with credentials:

statement.execute(“create or replace stage my_csv_stage url = ‘s3://”+paramsInCSV.get(“bucketName”)+”’ credentials = (aws_key_id=’”+connParams.get(“accessKey”)+”’ aws_secret_key=’”+connParams.get(“secretKey”)+”’);”);

Getting Data from a Parquet File

To get columns and types from a parquet file we simply connect to an S3 bucket. The easiest way to get a schema from the parquet file is to use the ‘ParquetFileReader’ command. I have seen a few projects using Spark to get the file schema. It is possible but very ineffective as we are planning to run the application from the desktop and not from AWS.

Now, a few words about type conversion. Very often, the string values in Parquet files are stored as ‘Binary’ with a UTF option. I decided to capture that and cast it to ‘VARCHAR’. Another trick is that sometimes INT96 is stored as a timestamp value, and the application automatically casting it too. ‘BYTE_ARRAY’ will be cast to Snowflake ‘BINARY’. All other Parquet types are compatible.

Here are the Parquet types:

BOOLEAN: 1 bit boolean
INT32: 32 bit signed ints
INT64: 64 bit signed ints
INT96: 96 bit signed ints
FLOAT: IEEE 32-bit floating point values
DOUBLE: IEEE 64-bit floating point values
BYTE_ARRAY: arbitrarily long byte arrays

You may find more details on how to merge Parquet with Snowflake here. For a Parquet file, we need to specify column names and casts. Two tips here: First, SQL is case insensitive, but column names should be used in a query with column name as specified in the Parquet file. The second tip: cast sometimes may be skipped. Here is a sample COPY command to upload data from S3 parquet file:

COPY INTO userdata1 FROM (SELECT $1:registration_dttm::TIMESTAMP, $1:id::INTEGER,  $1:first_name::VARCHAR, $1:last_name::VARCHAR, $1:email::VARCHAR,  $1:gender::VARCHAR, $1:ip_address::VARCHAR, $1:cc::VARCHAR, $1:country::VARCHAR, $1:birthdate::VARCHAR, $1:salary::DOUBLE,  $1:title::VARCHAR, $1:comments::VARCHAR FROM @my_csv_stage/userdata1.parquet)  on_error = 'skip_file' file_format = (type = parquet);

You may validate the statement for errors before the execution with validation_mode = ‘RETURN_ERRORS’;, more info here.

And the last step is to DROP the STAGE. Now you may check your table in the Snowflake.

This is a prototype and so many things could be added to improve and make this project more ‘pretty’ like adding ORC support, or getting a list for Parquet files to process them all or specifying a pattern for files, adding support for complex types, …split and compress huge local CSV files, adding proper logging and many, many, more.

Here is the project to download. Take a look and let me know what you think.

--

--