Analyze Data in R with Posit Workbench and Snowflake

Using the Snowflake Native Posit Workbench App

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

8 min readMay 15, 2024

Architecture diagram for the Snowflake Native Posit Workbench App

This guide demonstrates how we used the Snowflake Native Posit Workbench App to analyze web traffic data, resulting in a Quarto report and Shiny app. We will cover how to:

Open an RStudio Pro IDE R session from within Snowflake, so your data never leaves Snowflake as you analyze it.
Handle Snowflake credentials automatically in the native app, making it easy to connect to Snowflake from R.
Connect to a Snowflake database using the {odbc} and {DBI} R packages.
Use {dbplyr} to manipulate database tables as if they are normal R data frames.

First, let’s introduce Posit Workbench and Snowflake Native Apps.

💡 Want to follow along? You’ll need to:
1. Add the web traffic data to your Snowflake account (it’s free!)
2. Pull our Quarto file into Posit Workbench.
You can see details of how to do so in our step-by-step Quickstart Guide.

Posit Workbench

Posit Workbench is a development environment that allows you to code in both R and Python using Jupyter, VSCode, or the RStudio IDE.

Learn more about Posit Workbench.

Snowflake Native Apps

A Snowflake Native App is an application installed within Snowflake’s environment. Developers can create Native Apps to make their applications easily consumable by Snowflake users.

The Snowflake Native Posit Workbench App allows users to develop in their preferred IDE in Posit Workbench alongside their Snowflake data, all while adhering to Snowflake’s security and governance protocols.

❄️ Snowflake Native Posit Workbench App

We’ll assume that you already have the Snowflake Native Posit Workbench App installed. If you’d like more details about how to install the app, see our Quickstart Guide or accompanying video.

Now, let’s take a look at the Snowflake Native Posit Workbench App. Here’s what the app looks when it opens:

The Snowflake Native Posit Workbench App homepage

To launch a new Posit Workbench Session, click New Session.

You’ll then be prompted to select an IDE. We’ll use the RStudio Pro IDE, but you can also use Jupyter Notebook, JupyterLab, or VS Code. You’ll also need to log into your Snowflake account. When the Snowflake authentication is successful, you will be able to start your session.

Once we launch the Workbench session, we’ll see the familiar RStudio IDE. Since the IDE is provided by the Posit Workbench Native App, we’ll be able to conduct our entire analysis securely within Snowflake.

🔌 Connect to Snowflake from R

We’ll explore the Cybersyn web traffic data from the Snowflake Marketplace.

We used a Quarto document to analyze the data. You can see our full analysis in our GitHub repository.

First, we need to connect to the data. We’ll use the {DBI} and {odbc} R packages to connect to the database. This will let us write code in R’s global environment, and execute it within the Snowflake database.

Build connection code with {DBI}

{DBI} is an a R package that allows us to connect to databases with DBI::dbConnect(). First, we need to load the packages, including {DBI}, that we’ll use in our analysis.

# Libraries
library(DBI)
library(dbplyr)
library(odbc)
library(tidyverse)
library(gt)
library(gtExtras)

To connect to our Snowflake database, we’ll use a driver provided by the {odbc} package. We also need to provide a warehouse for compute and a database to connect to. We also provide a schema here to make connecting to specific tables and views easier.

# Connect to the database
conn <-
 DBI::dbConnect(
   drv = odbc::snowflake(),
   warehouse = "DEVREL_WH_LARGE",
   database = "WEB_TRAFFIC_FOUNDATION_EXPERIMENTAL",
   schema = "CYBERSYN"
 )

📦 The Snowflake driver is available with the latest version of the {odbc} package. See the Quickstart Guide for more details.

Now, our connection is stored as the object conn.

Notice that we didn’t need to provide any credentials in the connection code. When we logged into Snowflake before launching the session, we provided Workbench with our credentials. Now, Posit Workbench manages the credentials for us.

Once we build a connection, we can see the databases, schemas, and tables available to us in the RStudio IDE Connections pane.

Create tbl’s that correspond to tables in the database

Now that we’ve built a connection, we can use dplyr::tbl() to create tbls. A tbl is an R object that represents a table or view accessed through a connection.

The following code creates tbls from two of the web traffic database’s views.

timeseries <- tbl(conn, "WEBTRAFFIC_SYNDICATE_TIMESERIES")
attributes <- tbl(conn, "WEBTRAFFIC_SYNDICATE_ATTRIBUTES")

🔁 Use {dbplyr} to translate R to SQL

Now, we can use the {dbplyr} package to query the data with R, without having to write raw SQL. Let’s take a look at how this works.

{dbplyr} allows you to write {dplyr} code to manipulate a tbl, allowing you to use familiar {dplyr} functions with database tables.

For example, we can filter rows and select columns from our data.

facebook <-
  timeseries |> 
  filter(
  DOMAIN_ID == "facebook.com",
  lubridate::month(DATE) == 5
 ) |> 
 select(DATE, VARIABLE_NAME, VALUE)

When we use {dplyr} functions with tbls, {dbplyr} translates the R code into SQL queries. Think of any object made downstream from a tbl like a view: it contains SQL that represents a table built from the tbl. If we want to see the SQL code that {dbplyr} generates, we can run dbplyr::show_query().

facebook |> 
  show_query()

SELECT "DATE", "VARIABLE_NAME", "VALUE"
FROM "WEBTRAFFIC_SYNDICATE_TIMESERIES"
WHERE ("DOMAIN_ID" = 'facebook.com') AND (EXTRACT('month', "DATE") = 5.0)

To save compute, R waits to execute a query until we request the data that it represents.

To save memory, R stores the data as a temporary file in the database (instead of in R’s global environment). When we inspect the data, R only returns the first few rows of this file to display.

facebook

# Source: SQL [?? x 3]
# Database: Snowflake 8.16.0[@Snowflake/WEB_TRAFFIC_FOUNDATION_EXPERIMENTAL]
 DATE VARIABLE_NAME VALUE
 <date> <chr> <dbl>
 1 2021–05–02 Sessions (Count) | Week | All Devices | Model version 2024.01.25 10016599466
 2 2021–05–09 Sessions (Count) | Week | All Devices | Model version 2024.01.25 10024864464
 3 2021–05–16 Sessions (Count) | Week | All Devices | Model version 2024.01.25 9948687298
 4 2021–05–23 Sessions (Count) | Week | All Devices | Model version 2024.01.25 9913216616
 5 2021–05–30 Sessions (Count) | Week | All Devices | Model version 2024.01.25 9886247565
 6 2022–05–01 Sessions (Count) | Week | All Devices | Model version 2024.01.25 12502175558
 7 2022–05–08 Sessions (Count) | Week | All Devices | Model version 2024.01.25 12140393116
 8 2022–05–15 Sessions (Count) | Week | All Devices | Model version 2024.01.25 12446181861
 9 2022–05–22 Sessions (Count) | Week | All Devices | Model version 2024.01.25 12436995354
10 2022–05–29 Sessions (Count) | Week | All Devices | Model version 2024.01.25 12418458176
# ℹ more rows
# ℹ Use `print(n = …)` to see more rows

In summary

This {dbplyr} workflow:

1. Keeps our data in the database, saving memory in the R session
2. Pushes computations to the database, saving compute in the R session
3. Evaluates queries lazily, saving compute in the database

Among other things, this allows us to work with much larger data sets than we’d typically load into R. We don’t need to manage the process, it happens automatically behind the scenes.

Learn more about {dbplyr} at dbplyr.tidyverse.org.

📊 Create plots and tables

Now, let’s create a visualization and table to display aspects of the web traffic data.

Visualize Data with {ggplot2}

We can use {ggplot2}to visually compare the web traffic of two domains over time. Here we can see that, since 2021, airbnb.com has always been more popular than vrbo.com, but page views for both really started to take off in 2024.

domains <- c("airbnb.com", "vrbo.com")

timeseries |>
 filter(domain_id %in% domains) |>
 ggplot(aes(date, pageviews, color = domain_id)) +
 geom_line() +
 scale_y_log10() +
 theme_minimal() +
 theme(legend.position = "bottom") +
 labs(
   x = "",
   y = "",
   color = "",
   title = "Pageviews"
 )

Pageview comparison for airbnb.com and vrbo.com

📘 Take a look at the Quickstart Guide for a more in-depth analysis of the web traffic data.

Make publication-ready tables with {gt}

We can use the {gt} package to create a polished table that displays elements of the web traffic data.

First, we need to prepare a new tibble that summarizes metrics by domain.

comparison <-
  timeseries |>
  group_by(domain_id) |>
  summarize(
    across(
      c("pageviews", "users", "sessions"),
      \(x) median(x, na.rm = TRUE),
      .names = "avg_{.col}"
    ),
    .groups = "drop"
  ) |>
  arrange(desc(avg_pageviews))

Now, we can build our {gt} table.

comparison |>
  gt(rowname_col = "domain_id") |>
  fmt_number(suffixing = TRUE, decimals = 0) |>
  cols_label(
    avg_pageviews = "Pageviews",
    avg_users = "Users",
    avg_sessions = "Sessions"
  ) |>
  gt_highlight_rows(
    rows = (domain_id == "airbnb.com"),
    fill = "#ffce67"
  ) |>
  gt_highlight_rows(
    rows = (domain_id == "vrbo.com"),
    fill = "#78c2ad"
  )

Learn more about {gt} at https://gt.rstudio.com/.

Now that we’ve accumulated some insights, let’s think about how we might present the results of our data analysis to our colleagues.

📄 Build Reports and Dashboards with Quarto

We’ve conveniently written our analysis in a Quarto document, lending-rate-prediction.qmd. Quarto is an open-source publishing system that makes it easy to create data products such as documents, presentations, dashboards, websites, and books.

By placing our work in a Quarto document, we have interwoven all of our code, results, output, and prose text into a single literate programming document. This way, everything can travel together in a reproducible data product.

We can render this data product into any of the asset types above with a simple command. In the Quickstart Guide, we render our work into an HTML report, shown below.

Learn More about Quarto

Learn more about Quarto here at quarto.org. You can see documentation for all the various Quarto outputs here. Quarto works with R, Python, and Javascript Observable code out-of-the box, and is a great tool to communicate your data science analyses.

✨ Shiny Application

Our visual analysis compared two specific websites: airbnb.com and vrbo.com, but we could reuse the code to compare any combination of websites. An efficient way to do this is with an interactive Shiny app.

Take a look at our example Shiny app in our GitHub repository.

Shiny app displaying the web traffic data

Learn More About Shiny

You can learn more about Shiny at https://shiny.posit.co/. This example uses Shiny for R, but Shiny for Python is also available!

If you’re new to Shiny, you can try it online with Shinylive.
It too, comes in a Python version.

✅ Conclusion

R is beloved by data scientists for its intuitive, concise syntax. You can now combine this syntax with the power and peace of mind of Snowflake. The Posit Workbench Native Application provides an IDE for R within Snowflake. You can then use R’s existing database packages — {DBI}, {odbc}, {dbplyr} — to access your Snowflake databases.

To learn more, take a look at our step-by-step Quickstart Guide or the accompanying video.