Snowflake: Setting up, loading data and connecting from R

Connor Higgins
Connor Higgins
Published in
8 min readJul 19, 2020

This is a short exploration of digital warehouses in Snowflake with a particular emphasis on working towards integrating it into R workflows. For that reason we use R for cleaning data, writing R-processed data to the warehouse using SnowSQL (where SnowSQL script files are an option, so can be easily integrated with Python/R), and ultimately pulling data back into R for future analysis. Naturally this can largely be extended to Python as well.

Snowflake is a virtual warehouse that been seeing increasing use across the the field, as an option offering both high performance, cost effectiveness, and easy scalability. Snowflake also offers a trial version, which we can use to through together a virtual warehouse of our own.

For sample data, this time around we will use health insurance data stored in .csv format, however Snowflake also supports using AWS S3 buckets, Azure containers and a number of other sources. Further, part of Snowflake’s appeal is a growing ecosystem of partners that allows ETL from a range of other sources depending on your needs.

For this example we will create a warehouse, load in two csv’s (with some slight data cleaning in R), then connect to the Snowflake warehouse using R to query the data. Snowflake supports using a web ui for limited data loading, however we will be using SnowSQL here — as it allows both finer control, loading larger amounts of data and it is easier to maintain a script file.

There are a very few differences between SnowSQL and standard SQL, mostly relating to setting up a warehouse and data loading involves the extra step of “staging” files. However the differences are minimal! Anyone with SQL experience will already be familiar with almost all of the available commands.

Sources (both maintained by the US Dept of Health and Human Services):

‘states.csv’ could be loaded in with the web ui, if it is your preference. ‘rates.csv’ though, once you reach that size you should start considering SnowSQL or perhaps one of Snowflake’s partner services. We will use SnowSQL.

Creating Our Warehouse, Database and Schema:

With warehouse creation you have two options. The web ui (perfectly serviceable for this purpose or SnowSQL). We will create an extra large warehouse with default options, as it is generally recommended to start with the default then scale up/down as needed:

Unless it is necessary I’ll display the SQL text with Atom from here on out for readability. I highly recommend a good text editor such as Atom or Sublime for anyone using SnowSQL.

Creating the database and schema is the same as setting up any other SQL database. For here we’ll create an “Insurance” database and store our tables under the “Tabs” schema:

Loading Data:

  • States.csv

This is the smaller of the two but it does require some slight data cleaning.

Step 0: Data Cleaning (R)

In R, we can strip of the formatting for the percentage and currency columns so we can load in the data as numeric data for future analysis.

This is focused on Snowflake, not R, so I’ll avoid getting too technical. However the above simply performs some quick data cleaning, stripping out the “%” and “$” signs as applicable, then converting character columns to numeric data types (which we want in our database). The exception of course is the States column, which should remain a character column:

Next we switch back to SnowSQL. With the data cleaned, we can load it in to the warehouse we created.

Step 1: Create the table

This step will be very familiar to anyone who has a used SQL. We can use the above ‘glimpse’ command to select appropriate datatypes. We’ll put the table under the Tabs schema we created a the start:

Creating table itself does not take a significant amount of time:

Step 2: Stage and load

We will need to first stage the file, then load it into the table. To stage it, we create a named stage called ‘US_States’, use the ‘PUT’ command to load the file into the stage, then copy the data from the stage to the table. Finally we do some cleanup, dropping the stage.

Running those commands yields the following in SnowSQL:

As you can see, this takes longer than creating the table itself but the time is still reasonable. We could increase the warehouse size to increase performance, but it would also be valid to hold for cost-savings — it doesn’t seem strictly necessary to upgrade yet.

Querying the first 5 rows, to show our data loaded successfully:

  • Rate.csv

With this file we will be much more brief, it is the same steps and data cleaning is not necessary for this file. But it is informative to see just how much longer a large file will take to load. This file, at well over 1.5 million rows, is far larger than our ‘states’ data so we would expect a drastic performance decrease.

Step 1: Create the table

A glance at the data for reference:

First of course we create the table:

And, as should be familiar by now, stage and load the data:

As we might have feared, this operation took far longer than before:

This took approximately 660% more time than the previous operation at a little under 5 minutes. It would depend on the use case of course, but it is possible we would need to consider upgrading at this point if speed is a consideration. Especially when we start looking at the queries. In a business setting, we would want to have a set of representative queries then run timed trials to see if the queries are fast enough for our purposes.

Bonus: Connecting R and Snowflake

Snowflake does have an R package available on Github based on dbplyr. However this package is still in development and if you have experience with SQL (as you likely do if you are using Snowflake), I might actually recommend avoiding it and taking a route that gives you finer control over your queries.

I will use the ODBC driver made available here (a JDBC driver is also available). As a note for Python users, Snowflake also makes a Python connector available.

When setting up the connection through your preferred means, a simple way to check you account name is to just log in to Snowflake and check the url. Your account name, along with specific region information, will be available there (highlighted in orange below):

Account name is highlighted in orange: i.e. xyz.us-east-1

After setting up the connection you will be able to connect from a tool of your choice. In R we’d start by creating a connection object (called “con”), which we can feed our queries to.

How you evalate queries is largely a personal preference. If you prefer dplyr and would like to avoid SQL, you have the dbplyr or the official Snowflake Github packages available. As a personal preference, I use the DBI package directly and immediately convert the data to a data.table.

Using DBI directly gives all the control you’d have writing a direct SQL query (you essentially are), then the data is converted to a data.table for rapid and easy data manipulation (the package has a deserved claim to fame for its speed, terse programming style).

We’ll keep the R section short, as this article is getting long, but lets continue bench-marking by seeing how long it takes us to pull each of our tables.

States, as expected, is quite fast being a fairly small table.

But what about our “problem” table, Rates…

As expected the query times are quite long, just as with loading. There are some more considerations, for instance is this simply for internal analytics where speed is not a tremendous concern? Or maybe we don’t need to pull the entire table, and can optimize our SQL queries before upgrading the table? Increasing performance by increasing the warehouse size would be simple from a technical perspective, the concern would come from cost — larger warehouses use credits at a quicker rate so will incur greater costs over the same timespan. As a free user of course, I will be content to be patient for my queries but in a business setting time may very well be money.

Conclusion

This article focused on a quick introduction to Snowflake, with a particular emphasis on setup and integrating with R workflows. We worked through the process of setting up a warehouse, database, loading in csvs, and connecting to the Snowflake Warehouse from R. We saw how data cleaning in R is still extremely vital in Snowflake, the performance differences between small and large file loads and how to start using the data from the warehouse in R (or Python).

Snowflake supports many more data sources than a file though. Future articles will likely be taking a focus on the ecosystem Snowflake has cultivated and the many options those tools give to users. Additionally Snowflake allows loading data from other traditional cloud services — most notably AWS and Azure. Though this series is meant to focus more on infrastructure rather than analysis, an article on analyzing the health insurance database we are essentially creating is not at all out of the question.

--

--

Connor Higgins
Connor Higgins

Current graduate student at Northeastern University, pursuing a career in data science. Also an avid reader of speculative fiction!