Snowflake & S3 Buckets for Reading CSV Files

Working with CSV Files (Simple Format) and Optimization Processes

Published in

Globant

10 min readJan 19, 2023

The goal is to show the process of reading CSV files from an S3 bucket using different tools/methodologies offered by Snowflake.

For the achievement of this goal, our principal exercise consists in reading a couple of CSV files with simple information (either JSONor ARRAYas column content). The information in the files consists of trips made for members of a cycling community between points A-B with member information, the initial and final times, and the stations coming from an S3 bucket currently running (Snowflake manages this bucket).

During this work with these files, the idea is to learn about the following:

Creation of a database/stage/file format
Permission in roles and components
Use and switch between two fundamentals roles in Snowflake, such as SYSADMIN and ACCOUNTADMIN
Application of different configurations in our warehouse configuration

And also, as extra content, the acquired knowledge about the implementation and evaluation of data warehouse optimization depends on the capacity and times needed for the business.

The result of this exercise is the acquisition of knowledge and the advancement in the mastery of possible tasks to be performed in the data world related to the reading and processing of complex files such as CSV from remote storage such as Amazon S3 using a powerful data warehouse such as Snowflake.

As a preference, we can start with an overview of the tools used, details, and the steps involved in the exercise.

A brief about the tools to handle

The introductory material on the tools that the reader will handle during the exercise consists of two elements:

Snowflake is the data warehouse to read files, process them, and save the result in a tangible table.
Amazon Service Storage, as the initial storage, takes the CSV files used during the exercise.

Snowflake

Snowflake Data Cloud is an advanced data platform provided as Software-as-a-Service (SaaS). It includes many data storage, processing, and analytic solutions that are faster, easier to use, and more flexible than traditional offerings.

As shown in the previous image, this platform has a big offer with different ways and tools to do an environment collaborative between business data areas such as Engineering, Cybersecurity, Data Science & ML, and, Backend/Applications.

To create and maintain the environment, using data as input to generate outcomes such as insights, predictions, monetization reports, and data products allows the generation of a continuous flow between areas and processes.

Amazon Simple Storage Service (S3 Bucket)

Amazon Simple Storage Service (Amazon S3) is an object storage service for storing and protecting many amounts of data for various use cases. An S3 bucket can store data in any format (records, images, videos, and complex data) in different sizes.

This storage is an excellent control to access the data and optimize costs while adding various features such as region replication, VPC, protection, etc.

Amazon S3 Bucket — Extracted from https://aws.amazon.com/s3/

About the exercise

In addition to the information mentioned as an introduction to the exercise, objectives, and conceptual terms to be applied and learned, the reader needs the following steps and pieces of knowledge to understand and apply this task:

Pre-requirements

The exercise needs a Snowflake Account to work on the sample data. If the reader has not an account, Snowflake offers a way to create a trial account.
The reader needs initial knowledge about these concepts in Snowflake or any other database/data warehouse: ROLES, WAREHOUSE, STAGE, COPY INTO, and FILE FORMAT.
The Snowflake account needs permission to execute SYSADMIN and ACCOUNTADMIN roles.

Let us get down to work

After understanding and complying with these prerequisites, you can start developing the exercise.

1- Access to the Snowflake console

In this step, we enter the Snowflake console we have at our disposal, or the one built as indicated in Step 1 of the Prerequisites. This is intended to give users access to the main console so that the exercise can begin.

Then, we can open a worksheet:

And we can find the possibility to create a new one in any part into a folder or not:

And as a result, we have our worksheet ready to continue with the code part:

The new worksheet to save our statements.

**2- Creation of CIBA database**

In this step, it is necessary to create a database to save our information in the following steps. For this, the process begins with a role equal to SYSADMIN for database creation (this role is designated to create the component into the database without admin permission):

USE ROLE SYSADMIN;

CREATE OR REPLACE DATABASE CIBA COMMENT = 'Test database for CIBA company';

After the execution, the database is created and ready to use for the SYSADMIN role

3- Adjustments **in the COMPUTE_WH warehouse configuration**

In this part, the assignment of an intermediate configuration for the data warehouse is necessary to check the performance of the queries with a configuration that generates little cost.

For that, the ACCOUNTADMINrole grants permission to SYSADMIN to work on the next steps with the warehouse configuration.

Also, the warehouse should be sized MEDIUM with the largest cluster number of 2 to confirm the data loading/downloading performance.

USE ROLE ACCOUNTADMIN;

GRANT USAGE ON WAREHOUSE COMPUTE_WH TO ROLE SYSADMIN;

CREATE OR REPLACE WAREHOUSE COMPUTE_WH 
WITH WAREHOUSE_SIZE = 'MEDIUM'
WAREHOUSE_TYPE = 'STANDARD' 
AUTO_SUSPEND = 60 
AUTO_RESUME = TRUE
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 2 
SCALING_POLICY = 'STANDARD';

After the execution, all changes were done in our warehouse, called COMPUTE_WH

**4- Creation of a trips table**

In this step, and with the previous overview about the context of the data in the files, we can find more details about the data type needed to fill this table as the trip duration with their start and final time; the general information about the initial and final station of the trip; the general information about the bike used, the user and details about the membership.

A DDL was created to work in this exercise for the trips table.

Note: The use of a general role, a warehouse, a database, and a schema is necessary to generate the table

USE ROLE SYSADMIN;
USE WAREHOUSE COMPUTE_WH;
USE DATABASE CIBA;
USE SCHEMA PUBLIC;


CREATE OR REPLACE TABLE TRIPS  
(TRIPDURATION INTEGER,
  STARTTIME TIMESTAMP,
  STOPTIME TIMESTAMP,
  START_STATION_ID INTEGER,
  START_STATION_NAME STRING,
  START_STATION_LATITUDE FLOAT,
  START_STATION_LONGITUDE FLOAT,
  END_STATION_ID INTEGER,
  END_STATION_NAME STRING,
  END_STATION_LATITUDE FLOAT,
  END_STATION_LONGITUDE FLOAT,
  BIKEID INTEGER,
  MEMBERSHIP_TYPE STRING,
  USERTYPE STRING,
  BIRTH_YEAR INTEGER,
  GENDER INTEGER);

After the execution, the table called trips was created into CIBA.PUBLIC database/schema

TRIPS table created into CIBA as database and PUBLIC as schema.

**5- Creation of CIBA_TRIPS stage**

With the database and table created in the above steps, this step is fundamental to the stage creation to continue the loading process using the S3 bucket designated. The sentences to do this goal are the following:

CREATE STAGE "CIBA"."PUBLIC".CIBA_TRIPS 
 URL = 's3://snowflake-workshop-lab/citibike-trips-csv' 
 COMMENT = 'EXTERNAL STAGE FOR CIBA LOADING PROCESS';

In detail, the bucket used in this process and maintained by Snowflake during a lecture held earlier during a lab. This bucket has many types of data samples to work with them.

After the execution, our stage was created in our database.

Then, only to confirm the correct connection, it is a good practice the execution of a list function to see all files in the bucket:

LIST @CIBA_TRIPS;

A good idea, too, to test the amount and sizes in all files generated in the last execution using the following query:

SELECT FLOOR(SUM($2)/POWER(1024, 3),1) TOTAL_COMPRESSED_STORAGE_GB,
    FLOOR(AVG($2)/POWER(1024, 2),1) AVG_FILE_SIZE_MB,
    COUNT(*) AS NUM_FILES
  FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()));

After this, we can detect the total storage consumed by the files in GB, an average between the file sizes and the total amount of files

Details in sizes and amount of files in the STAGE created.

**6- Creation of CSV_FORMAT file format**

The CSV reading process is primordial to create a file format object in Snowflake because this element is a priority to read any complex data type from any kind of file to save all in Snowflake.

But, we will apply as a good practice the creation of a new database to manage these types of components since thinking in the future, from a database it is easier and more controlling to share the access between users/database on these components. Thus, the lines to execute are the following ones:

CREATE OR REPLACE SCHEMA FILE_FORMAT;

USE SCHEMA FILE_FORMAT;

And finally, the file format to read a simple CSV file is the following:

CREATE FILE FORMAT "CIBA"."FILE_FORMAT".CSV_FORMAT 
TYPE = 'CSV' COMPRESSION = 'AUTO' 
FIELD_DELIMITER = ',' 
RECORD_DELIMITER = '\n' 
SKIP_HEADER = 0 
FIELD_OPTIONALLY_ENCLOSED_BY = '\042' 
TRIM_SPACE = FALSE 
ERROR_ON_COLUMN_COUNT_MISMATCH = FALSE 
ESCAPE = 'NONE' 
ESCAPE_UNENCLOSED_FIELD = '\134' 
DATE_FORMAT = 'AUTO' 
TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('') 
COMMENT = 'FILE FORMAT TO LOAD CSV - CIBA';

After this, our CSV_FORMATwas created in our FILE_FORMATdatabase:

CSV_FORMAT file format created in FILE_FORMAT schema.

This previous work in this exercise shows us the best way to generate the STAGE component and use it to connect with an S3 bucket containing many CSV files.

Also, as a recommendation, it is a good practice to create a new schema to handle FILE FORMAT and other structures generated in Snowflake other than tables and data.

So, we are ready in the practice part to learn about creating different kinds of elements in Snowflake and validating the files extracted from an S3 bucket.

Now, as the final step, we can start to check the best way to find a correct configuration with our warehouse, considering all the necessary economic aspects, and always fulfilling the business needs for the availability and agility of the data.

What needs to be evaluated next?

Let’s imagine that, in a production case, the company needs to load many files from a table in Snowflake and optimize resources and costs during these tasks.

And according to the introduction about the steps to complete this exercise, we need to start evaluating with different configurations our response times with this data flow and upload it to the database, so, is this exercise a good approach to help us in a final decision on this part? Yes, it is.

The best way to test this is to simulate a general load with different warehouse sizes during the load, validate if the configuration is the most appropriate for the need, and try several options until you find the best time/cost needed by the business and the rest of the process that consumes the final table and data.

Let’s continue with the exercise

Now, with this new focus on this, the use of PUBLICschema is necessary to list all elements called through the stage created, and then, the following script is the owner to show us a file from the STAGE:

USE SCHEMA PUBLIC;

SELECT $1, $2, $3, $4, $5
FROM @CIBA_TRIPS/trips_2018_7_2_0.csv.gz
(FILE_FORMAT => "CIBA"."FILE_FORMAT".CSV_FORMAT )
LIMIT 100;

After the execution, we can show the different columns and information in the file called trips_2018_7_2_0.csv.gz

Content in a file obtained from a STAGE.

Then, after this validation, we can execute a COPY INTO to save the information located on the S3 bucket through the STAGE to save the records in the trips table.

COPY INTO TRIPS 
FROM @CIBA_TRIPS/trips
FILE_FORMAT="CIBA"."FILE_FORMAT".CSV_FORMAT;

After this, we can find all the files in the STAGE in the trips table.

Then, we can confirm the total amount of the records in the tables with the following statement:

SELECT COUNT(*) FROM TRIPS;

Query details related to the COUNT statement.

We can see the time that Snowflake took to process these files and save them in a table with a COPY INTO (13 seconds) and during a COUNT(*) (0.118 seconds) to know the total amount of records in this table.

Is a good configuration for loading these files on cost? No, because Snowflake has a warehouse with a MEDIUMsize, and the loading data in this table must be a quick load.

So, we can test with the following configuration:

TRUNCATE TABLE TRIPS;

ALTER WAREHOUSE "COMPUTE_WH" 
SET WAREHOUSE_SIZE = 'LARGE'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE;

SELECT COUNT(*) FROM TRIPS;

In this case, the LARGEsize in our data warehouse handles more memory to create more threads during the load. Also, an AUTO_SUSPENDequal to 60 handles the resource costs for this warehouse (After 60 min without activity, the warehouse should be off).

Content in a file obtained from a STAGE with this new warehouse configuration.

Query details related to the COUNT statement with this new warehouse configuration.

In this case, our results are based on 10 seconds during the loading process using COPY INTO a table as a target (3 seconds less than the first attempt) and 0.0075 seconds to take the total amount of this table (the best time compared to the first attempt.)

Conclusion

With these examples, we can see:

How to work with any file from remote storage, and how Snowflake, from its stack of components as a data warehouse, allows processing into readable and easily accessible data.
How the process of testing is to preserve the interests of costs and benefits within a company since it is a tool designed to adapt to any type of business, i.e., from a small company that does not need much processing and stability of services versus another company that handles streaming data in large volume and availability of the same.