Simple task II: CSV structure from AWS S3 to Snowflake

More snow, more fun, more Snowflake.

I like snow, I love skiing, I admire Snowflakes. From my previous post, I received a few valuable suggestions on how to improve the application and make it better and with a big chance I will implement them in next-next Version 1.3. In current version something V 1.1.9…, the app is able to do reading CSV file from AWS S3 location as CSV file format and as GZIPed. After the app will read the file header, it will create a destination SQL table schema in Snowflake DB and populate the table with data from the S3 file. I’m using for my development an Eclipse IDE with installed AWS plugin. During the plugin setting, you are adding your credentials to get access to AWS services. The credentials will be stored in a separate hidden location. If you will build this application and later want to use somewhere else without an Eclipse IDE you need to install AWS CLI that will add all required tools and will ask you to store your AWS credentials as well.

This version of the app required two parameters to run, first — ‘local’ or ‘aws’, and second — if ‘local’, a path to CSV file and for ‘aws’ — use just bucket plus prefix with an object/file name. Something like:

aws “my-first-s3-bucket-oh-my/star0000_h.csv.gz”

The Snowflake JDBC driver is a part of a project now. In a previous post, I covered the logic of how the app is processing a CSV format. Now using the same approach we will read structure and data from AWS S3 file. The AWS credentials are a very valuable piece of info and should be never stored in the code. I think if you are using GitHub for your repository it may alert you about this mistake if creds stored int the code, but malicious people also acting very fast to get this valuable data. The moral, do not be a “too lazy coder”, add logic to get your AWS credentials properly. Assuming you have stored some CSV or CSV.gz files on AWS S3 and got your AWS credentials for your account. Now, getting the AWS credentials from a hidden file and storing it for later use. Of course, we have a fork in logic to process a simple CSV vs GZIP CSV files. If Gzip, then unwrap it and process it as the CSV file, reading a chunk of a file to get CSV header, combine an SQL statement to create a table. Comparing to a previous post, the process to load data into Snowflake almost similar but with small differences. You may find more details here how to copy data from AWS S3 to Snowflake tables. For now, we posted an SQL to create the table on Snowflake and ready to connect to Snowflake environment. In the app, we are connecting to the database, schema with a user and data warehouse that was specified in the app config. Connected, and with a statement, we are creating a table:

statement.execute(paramsInCSV.get(“createTableSql”));

Now we need to create an External Stage that would point to AWS S3 with credentials:

statement.execute(“create or replace stage my_csv_stage url = ‘s3://”+paramsInCSV.get(“bucketName”)+”’ credentials = (aws_key_id=’”+connParams.get(“accessKey”)+”’ aws_secret_key=’”+connParams.get(“secretKey”)+”’);”);

The bucket name will be cut from your provided S3 path and credentials will come from your AWS credential file. If the file will contain some date column it will be validated against your provided formats in the app config file and if it matches, the Date format will be placed to a copy command:

String sLine2CopyS3 = “copy into “+paramsInCSV.get(“fileName”)+” from @my_csv_stage/”+paramsInCSV.get(“key”)+” on_error = ‘skip_file’ file_format = (type = csv field_delimiter = ‘,’ skip_header = 1 FIELD_OPTIONALLY_ENCLOSED_BY=’\”’ NULL_IF = (‘’,’NULL’, ‘null’, ‘\\N’) EMPTY_FIELD_AS_NULL = true “+sDateFormatSql+”);\n”;

This string explained in details in the previous post and more Snowflake detailed info here. Last step will delete the External Stage. To copy a big file from S3 make take some time, have a patience to wait :).

The updated config file has two new lines. ‘regionS3’ for an AWS S3 file location, should be in the same region as Snowflake environment, and ‘bytesFromFile’ — how many bytes will be read from a file to get the header and data for validation.

And again, many things may be added to improve and make this project more ‘pretty’: split and compress huge local CSV files, add proper logging, place support functions to a separate class, multi-files support may be added too, and many, many, many more.

Here is the project to download.

If you have any questions about Snowflake or have a question about the data warehouse built for the cloud, join the Snowflake Lodge Community and get help from experts.

For the next version, I’m planning to add reading from Azure. Stay tuned…

--

--