Using Hex’s new R support with Snowflake

Hex and Snowflake have been partnering closely for sometime in support of our joint customers, supporting Snowflake SQL and more recently our Snowpark Python functionality, to enable exploratory analysis and data-science workloads from Hex notebooks.

Snowflake also has a sizeable community of R users connecting from various client’s and tools and it became obvious that support for R + Snowflake from within Hex would be beneficial to our joint customers. This functionality is now available [Public Preview] within Hex, and in this blog we will introduce how you can use Hex to query Snowflake using different methods, with SQL and R code, depending on your preferences.

The Data

First we will need some data to demonstrate with. There are a few sample datasets provided by Snowflake, within all customer accounts [ database: snowflake_sample_data ], to enable users to get started quickly ^1. One of these is TPC.org's TPCH benchmark data. The benchmark has been around for many years and consists of 22 business oriented queries over a data-model representing a typical wholesale supplier business with customers, orders, products etc. The data is available in several scale-factors in Snowflake so you can use it to see how Snowflake performs at different data scale points. Within the snowflake_sample_data database, the schemas tpch_sf1, tpch_sf10, tpch_sf100, tpch_sf1000 provide the increasing scale-factors (GB's).

The Query

We will use Query 1 from the benchmark as basis for our various examples. You can see the standard query below:

SELECT
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1-l_discount))
as sum_disc_price,
sum(l_extendedprice * (1-l_discount) *
(1+l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
FROM
lineitem
WHERE
l_shipdate <= dateadd(day, -90, to_date('1998-12-01'))
GROUP BY
l_returnflag,
l_linestatus
ORDER BY
l_returnflag,
l_linestatus;

We will adapt this query slightly to add a join and additional filter on the ORDERS table. This join isn't functionally needed for the query, but will enable us to demonstrate joins in our code examples below. So the SQL for the adapted query looks like:

SELECT
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1-l_discount))
as sum_disc_price,
sum(l_extendedprice * (1-l_discount) *
(1+l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
FROM
lineitem JOIN orders ON (l_orderkey = o_orderkey)
WHERE
l_shipdate <= dateadd(day, -90, to_date('1998-12-01'))
AND o_orderstatus = 'O'
GROUP BY
l_returnflag,
l_linestatus
ORDER BY
l_returnflag,
l_linestatus;

First Hex R Project

Assuming you already have a Hex account, sign in to Hex2. Alternatively, you can signup for a Hex trial account here. We need to create a new R project so click on the drop down arrow for New project to get the New R project option:

Select New R project from the drop down.

You should be presented with a screen that look something like:

Initial Project view

We can start creating SQL and R code cells in the notebook, but to connect to Snowflake we will need a Snowflake account and credentials to create a database connection. You can sign up for a Snowflake trial account here if you don’t already have one.

You may already have some Snowflake database connections set up in your Hex Workspace to choose from, but if not on the left-hand side of the screen click on the:

Data Sources

Click on +Add, select Create data connection… and select the Snowflake option. Complete the form with your relevant account details and user credentials. It should look something like this when your done:

Be sure to de-select the option to share this connection with my workspace unless you want to share the connection with other users and projects in your Workspace.

Click on the SQL Cell option to get your first SQL cell in the notebook. Make sure that the database connection you just created is selected for the cell, and paste the adapted TPCH query from above into the Notebook.

In the database connection form above there is no field to add the default schema we want to use for the queries we will run. There are a few ways that we can handle this.

  1. Hex supports multi-SQL statement cells with Snowflake, separated via ;. We can add the schema (scale factor) [e.g. USE SCHEMA TPCH_SF1;] that you want to use before every query.
  2. You can fully qualify the table names within the query text with the schema name [e.g. TPCH_SF1.lineitem ].
  3. Hex supports templated queries. We can define an R variable (e..g. schema = 'TPCH_SF1' ), in an R cell containing the schema name as a string, and use it within the SQL cell using jinja style templating (e.g. {{ schema }}). As this is a Snowflake identifier we need to identify it as such within the Snowflake SQL e.g. identifier({{schema+".lineitem"}}) . You can see the two cells using this approach below, and which we will use within the Notebook for SQL as it enables us to change the schema name in one place to test at different scale factors if we so choose.
First Query cell

You can run the cell with command+enter, or via the play symbol in the top right of the cell.

It takes about 3 seconds to execute in Snowflake and return the result to Hex on a Snowflake XSMALL virtual warehouse.

Note: at the bottom of the cell that dataframe has appeared which assigns the result from the Query to a variable in R as a dataframe. You can click on it and rename it to something more meaningful (e.g. tpch_q1). Lets create a new R cell and use it to get the dimensions of our prior SQL result for example.

We can use this method to retrieve result sets, or whole tables from Snowflake into the Hex R session for further processing with R. e.g. [ SELECT * FROM identifier({{schema+".lineitem"}}) & SELECT * FROM identifier({{schema+".orders"}}].

Processing Data in Hex with R

Let’s test that out now by converting the TPCH SQL Query into a functional equivalent in R. R’s base dataframe functionality supports this, but R’s tidyverse packages, and the dplyr package specifically, extend this functionality and make it far easier to construct and maintain so we will use these. They also have the advantage of supporting databases to push down processing as we will see later. These R packages, and many others are pre-installed within the Hex environment, thus avoiding the need to install the packages. You can see the list of pre-installed packages by clicking on the Environment icon on the right hand side.

Environment & Packages

Create two new SQL cells that contain the SELECT * queries above and execute them. They may take a little while to run, as they are returning the full tables into the Hex R environment. In my case lineitem and orders took 30 and 7 seconds respectively, but now we have the data in R's memory and subsequent local processing within Hex should be much faster. Rename the dataframe outputs from the cells to lineitem and orders.

Our next R cell introduces dplyr syntax and constructs the ‘sub-query’ against the orders table to filter by order status.

library(dplyr) # Load the dplyr package for dataframe manipulations

system.time(

# `orders` references the SQL cell output we created above
dplyr_R_df_o <- orders %>%
select(O_ORDERKEY, O_ORDERSTATUS) %>%
filter(O_ORDERSTATUS == "O")

)

Note: We are referencing the orders dataframe that was returned via the SQL cell.

dplyr’s syntax is similar to SQL, and the pipe (%>%) operator enables us to chain operations together, passing the dataframe output of one operation into the input of the next. This enables nice composability and testing for data pipelines. We’ve wrapped the code with the system.time function so we can capture the processing time for the operation ( 0.1 seconds ). We could have written this as SQL and run it as a SQL cell, instead of our SELECT * from orders. This would have reduced the volume of data we returned to Hex, but maybe we know we will need the full orders data for additional processing in R within the notebook and it makes sense to return it in full to R. We can intermingle SQL and R code here depending on what we want to achieve, but a good principle is to only return to R what you know you need for subsequent processing within the notebook!

We will use the dataframe dplyr_R_df_o that we just created above and join it with lineitem dataframe, and produce the aggregated result. Create another R cell and add the following code.

system.time(

dplyr_R_df_tpch_result <- lineitem %>%
select(
L_ORDERKEY, L_SHIPDATE, L_RETURNFLAG, L_LINESTATUS, L_QUANTITY,
L_EXTENDEDPRICE, L_DISCOUNT, L_TAX
) %>%
filter(L_SHIPDATE <= (as.Date("1998-12-01") - 90)) %>%
inner_join(dplyr_R_df_o, by = c("L_ORDERKEY" = "O_ORDERKEY")) %>%
select(L_RETURNFLAG, L_LINESTATUS, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX) %>%
group_by(L_RETURNFLAG, L_LINESTATUS) %>%
summarise(
sum_qty = sum(L_QUANTITY),
sum_base_price = sum(L_EXTENDEDPRICE),
sum_disc_price = sum(L_EXTENDEDPRICE * (1 - L_DISCOUNT)),
sum_charge = sum(L_EXTENDEDPRICE * (1 - L_DISCOUNT) * (1 + L_TAX)),
avg_qty = mean(L_QUANTITY),
avg_price = mean(L_EXTENDEDPRICE),
avg_disc = mean(L_DISCOUNT),
count_order = n(),
.groups = "drop"
) %>%
ungroup() %>%
arrange(L_RETURNFLAG, L_LINESTATUS) %>%
collect()

)

This takes 1.8 seconds to execute in my environment. The code should be relatively easy to understand as the function names are self-explanatory. It also looks similar to Snowflakes Snowpark Python API if your familiar with that!

We can create another R cell to return and inspect the result. e.g.

dplyr_R_df_tpch_result

It should be identical to our earlier SQL cell based result.

This will work for smaller data sets, in this case the smaller TPCH scale factors where the result sets fit in the the memory of the Hex Workspace, but it will not work for much larger datasets, and it might not be efficient to return and process large datasets in R. My notebook is running on a Medium Hex compute which provides 8GB of Ram and a single CPU. We could scale this up to an Extra Large with 32GB of Ram and 4 CPUs, but at some point we will reach the memory and compute limits of a single Hex notebook environment if our data is large enough. We can also see from this example that the SQL query of 4 seconds, takes a combined total of around 40 seconds to run in R including the time to return the needed Snowflake source data into R dataframes. So 10x slower than using Snowflake SQL to process the result, even at the the lowest TPCH scale factor.

What if I prefer to write my data analysis and transformation code in R, but get the performance and scalability advantages of running the processing in Snowflake? This is the joy of the dplyr package ecosystem. With some minor changes we can use dplyr syntax and push down processing as SQL into Snowflake. Let’s see…

Processing Data in Snowflake with Hex and R

With dplyr's add on dbplyr package we can connect to Snowflake via odbc using R's odbc package to implicitly convert our R based data-transformations to SQL to run in Snowflake. We load the necessary packages.

library(odbc)
library(dbplyr)

We now need to connect to Snowflake from odbc, for which the pre-installed Snowflake ODBC driver in the Hex Environment saves us the steps of installing and configuring these. The following R code does that:

#  Adjust the settings below for your Snowflake account, user, warehouse etc.
server <- "DEMO.SNOWFLAKECOMPUTING.COM"
database <- "SNOWFLAKE_SAMPLE_DATA"
schema <- "TPCH_SF1"
role <- "ACCOUNTADMIN"
warehouse <- "XS"

## Obtain ODBC Connection
odbcCon <- dbConnect(
odbc::odbc(),
.connection_string = sprintf("Driver={SnowflakeDSIIDriver};server={%s};uid={%s};pwd={%s};database={%s};schema={%s};warehouse={%s};role={%s}",
server,uid, pwd, database, schema, warehouse, role ), timeout = 10)

The observant among you might notice that we are using two R variables, uid and pwd, that we have not been previously defined anywhere? Hex enables us to store secrets and credentials securely and use them within our R session. Click on the Variables icon on the left-hand side of the screen:

Variables

We can now add the two secrets with the same variable names for our user-id uid and password pwd. You can create these as Project or Workspace secrets depending on whether you want to reuse and share them across the Workspace or keep them specific to the project.

Now we have done this we can run the cell with the ODBC connection code above and get a database connection odbcCon. Let's test the connection and also demonstrate that we can run SQL embedded within R cells:

# Check Connection
sqlString <- 'Select current_user(), current_role(),
current_database(), current_schema(),
current_warehouse()'
dbGetQuery(odbcCon, sqlString)

So we have two methods of running SQL within Hex, either directly as SQL cells with a Hex database connection, or using SQL text-strings within R via an odbc connection.

Lets ensure that the Snowflake odbc session we have just created does not make use of previously executed queries (result caching) by running the following in an R cell. This will ensure that the timings we capture for executing queries and dplyr in-database work are accurate:

dbGetQuery(odbcCon, 'ALTER SESSION SET USE_CACHED_RESULT = FALSE')

Lets run our TPCH SQL query again and capture the timing :

tpch_R_SQL = r'(
SELECT
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1-l_discount))
as sum_disc_price,
sum(l_extendedprice * (1-l_discount) *
(1+l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
FROM
lineitem JOIN orders ON (l_orderkey = o_orderkey)
WHERE
l_shipdate <= dateadd(day, -90, to_date('1998-12-01'))
AND o_orderstatus = 'O'
GROUP BY
l_returnflag,
l_linestatus
ORDER BY
l_returnflag,
l_linestatus;
)'

system.time(

tpch_R_SQL_result <- dbGetQuery(odbcCon, tpch_R_SQL)

)

It takes 2.5 seconds to run. Almost identical to the timing when running as a SQL cell.

So what do we need to do if we want to use our dplyr code to execute within Snowflake? Firstly, we need to set up references to our Snowflake tables using our new ODBC connection, which we do using tbl( <connection>, <tablename>) to create tibbles, a refined type of dataframe. Here I have used new variable names so we can illustrate some further points below, but I could have reused the variable names ( lineitem and orders ) that we used previously and no further code changes would be required to run our prior dplyr code.

orders_sf_tbl   <- tbl(odbcCon, "ORDERS")
lineitem_sf_tbl <- tbl(odbcCon, "LINEITEM")

Lets get the dimensions of the two different lineitem objects to check they are the same.

cat('Dim of Tibble:', dim(lineitem_sf_tbl), '   Dim of Dataframe',dim(lineitem))

Note: We are missing the record count (NA) for the Snowflake table! Using dim on database tibbles does not return row counts as the cost of computing them may be large for big data sets or complex query expressions. To get the row count we can instead run:

lineitem_sf_tbl %>% count() %>% as.data.frame()

Note: we have piped the tibble returned from the dplyr count() operation into as.data.frame here so that it is displayed nicely with Hex's data frame viewer which expects a dataframe class, not tibble, to be returned.

Lets run and time the two dplyr transforms we executed previously. We’ve adjusted the dataframe/tibble source and target names as appropriate.

system.time(

dplyr_snowflake_o <- orders_sf_tbl %>%
select(O_ORDERKEY, O_ORDERSTATUS) %>%
filter(O_ORDERSTATUS == "O")

)
system.time(

dplyr_snowflake_TPCH_Q <- lineitem_sf_tbl %>%
select(
L_ORDERKEY, L_SHIPDATE, L_RETURNFLAG, L_LINESTATUS, L_QUANTITY,
L_EXTENDEDPRICE, L_DISCOUNT, L_TAX
) %>%
filter(L_SHIPDATE <= (as.Date("1998-12-01") - 90)) %>%
inner_join(dplyr_snowflake_o, by = c("L_ORDERKEY" = "O_ORDERKEY")) %>%
select(L_RETURNFLAG, L_LINESTATUS, L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX) %>%
group_by(L_RETURNFLAG, L_LINESTATUS) %>%
summarise(
sum_qty = sum(L_QUANTITY),
sum_base_price = sum(L_EXTENDEDPRICE),
sum_disc_price = sum(L_EXTENDEDPRICE * (1 - L_DISCOUNT)),
sum_charge = sum(L_EXTENDEDPRICE * (1 - L_DISCOUNT) * (1 + L_TAX)),
avg_qty = mean(L_QUANTITY),
avg_price = mean(L_EXTENDEDPRICE),
avg_disc = mean(L_DISCOUNT),
count_order = n(),
.groups = "drop"
) %>%
ungroup() %>%
arrange(L_RETURNFLAG, L_LINESTATUS)

)

Note: We have removed the collect() from the second cells statement and output the result to dplyr_snowflake_TPCH_Q. This creates a reference to the operation, but due to dplyr's lazy-evaluation does not run the query in Snowflake yet. Hence, the timing for these two operations is fast, 0.2 and 0.268 seconds respectively. We need to pipe to collect() or another materialisation method to execute the query. For example with:

system.time(

dplyr_snowflake_tpch_result <- dplyr_snowflake_TPCH_Q %>% collect()

)

This takes circa 2 seconds to run, and returns the result into dplyr_snowflake_tpch_result showing that we can use R's dplyr syntax, whilst still getting the performance of Snowflake for processing. When we return the result we can see it is identical to our results from the SQL and prior dplyr operations:

dplyr_snowflake_tpch_result

We can also see the generated SQL that gets executed in Snowflake if we need to using:

dplyr_snowflake_TPCH_Q %>% show_query()

Which returns:

<SQL>
SELECT
"L_RETURNFLAG",
"L_LINESTATUS",
SUM("L_QUANTITY") AS "sum_qty",
SUM("L_EXTENDEDPRICE") AS "sum_base_price",
SUM("L_EXTENDEDPRICE" * (1.0 - "L_DISCOUNT")) AS "sum_disc_price",
SUM(("L_EXTENDEDPRICE" * (1.0 - "L_DISCOUNT")) * (1.0 + "L_TAX")) AS "sum_charge",
AVG("L_QUANTITY") AS "avg_qty",
AVG("L_EXTENDEDPRICE") AS "avg_price",
AVG("L_DISCOUNT") AS "avg_disc",
COUNT(*) AS "count_order"
FROM (
SELECT
"L_RETURNFLAG",
"L_LINESTATUS",
"L_QUANTITY",
"L_EXTENDEDPRICE",
"L_DISCOUNT",
"L_TAX"
FROM (
SELECT
"L_ORDERKEY",
"L_SHIPDATE",
"L_RETURNFLAG",
"L_LINESTATUS",
"L_QUANTITY",
"L_EXTENDEDPRICE",
"L_DISCOUNT",
"L_TAX"
FROM "LINEITEM"
WHERE ("L_SHIPDATE" <= (CAST('1998-12-01' AS DATE) - 90.0))
) "LHS"
INNER JOIN (
SELECT "O_ORDERKEY", "O_ORDERSTATUS"
FROM "ORDERS"
WHERE ("O_ORDERSTATUS" = 'O')
) "RHS"
ON ("LHS"."L_ORDERKEY" = "RHS"."O_ORDERKEY")
) "q01"
GROUP BY "L_RETURNFLAG", "L_LINESTATUS"
ORDER BY "L_RETURNFLAG", "L_LINESTATUS"

The SQL is functionally identical to our original SQL, although sub-queries have been used to separate the queries and filter operations on LINEITEM and ORDERS. If you’ve looked at Snowpark, and the SQL it generates via the DataFrame.queries method, you will see it constructs it’s SQL in a similar fashion. We could also go to Snowflake's Snowsight UI, and examine the query-history and query-profile.

Grand Finale…

The new R support in Hex, combined with Snowflake, and R’s tidyverse packages provide a flexible development and analysis platform for executing R and SQL code. The built in support for Snowflakes odbc drivers and the pre-installed R packages makes it quick and easy to get started, without having to concern ourselves with infrastructure and package installation and configuration.

With dplyr we can get the benefits of working with this popular data-manipulation package such as data-pipeline composability, whilst retaining the performance and scale advantages of processing data in Snowflake. As the final proof, lets test that out by running the same dplyr pipeline on the largest 1TB TPCH scale factor dataset using the following code in an R cell.

dbGetQuery(odbcCon, 'USE SCHEMA TPCH_SF1000' )
dbGetQuery(odbcCon, 'ALTER WAREHOUSE XS SET WAREHOUSE_SIZE = LARGE WAIT_FOR_COMPLETION = TRUE' )

system.time(
dplyr_snowflake_tpch_result <- dplyr_snowflake_TPCH_Q %>% collect()
)

dbGetQuery(odbcCon, 'ALTER WAREHOUSE XS SET WAREHOUSE_SIZE = XSMALL WAIT_FOR_COMPLETION = TRUE' )
dbGetQuery(odbcCon, 'USE SCHEMA TPCH_SF1' )

The code sizes up the Virtual Warehouse from an XSmall to an XLarge, increasing the compute resource by a factor of 8x, to query 1000x the data! It sets the schema to the largest scale factor dataset, runs the dplyr pipeline and resets everything back once completed. The entire cell runs in under a minute, and the dplyr code takes 47 seconds of that time. Pretty cool, instantaneous scalable compute for R using Hex and Snowflake.

What Next ?

R support is already available in Hex along with the Snowflake odbc driver integration and R packages. Please give it a try and let us know your feedback, as we continue to enhance our R support in Snowflake and Hex.

During its annual Summit 2023 user conference in the last week of June, we announced Snowpark Container Services in Private Preview enabling both partners and customers to deploy applications within containers inside the Snowflake service and security perimeter. Hex has been a close partner in this new initiative and also announced last week their support for this. If your interested in trying out the above examples using Snowpark Container Services with Hex and R, please reach out to your Hex or Snowflake representative to get added to the Private Preview.

In the next blog in this series we will see how we can make use of some of the new Snowflake ML Powered functions which were also announced at Summit which are available from Snowflake SQL to perform applied and scalable machine learning in Snowflake from Hex with R.

^1. This data is provided in Snowflake in the snowflake_sample_data database. In the unlikely event that you don't have this database already defined in your account you can add it with these instructions.

--

--