How to create cross-AWS account RStudio connectivity with Amazon S3 & Redshift

Cognizant AI
CognizantAI
Published in
6 min readJun 1, 2021

By Syed Abdul Sathar Syed Allaudeen, Data Scientist & AI Lead

RStudio to S3/Redshift connectivity must be established to get the best ROI from the existing analytical investments (R models) and to stay relevant with the technology shift (AWS S3/Redshift).

R is one of the oldest and most powerful languages/environments for statistical computing and graphics. Many companies have developed predictive models using R and continue to use them in their operations. As part of cloud migration, these models need to read and write data from Amazon S3 and Amazon Redshift for training, predicting and storing the results for downstream applications such as Tableau, Power BI and web apps. When it comes to displaying meaningful and actionable results with sub-second response, a cloud-based data warehouse like Redshift is critical.

This document outlines the steps required along with the necessary libraries and code chunks to establish connectivity from RStudio to S3 (using AWS-roles) and to Redshift (using jdbc driver).

The following steps are also applicable when the S3/Redshift and RStudio are in two different AWS accounts.

NOTE: Please ensure you have a cross account role to access the S3 buckets from RStudio.

Step 1: Environment Setup

a. Install the following packages:

install.packages(“RJDBC”)
install.packages(‘devtools’)
devtools::install_github(“RcppCore/Rcpp”)
devtools::install_github(“rstats-db/DBI”)
install.packages(“aws.s3”)
devtools::install_github(“sicarul/redshiftTools”)
install.packages(“RPostgres”)
install.packages(“aws.signature”)
install.packages(“RAthena”)

b. Load the following packages:

library(RJDBC) library(aws.s3)
library(redshiftTools)
library(RPostgres)
library(aws.signature)
library(RAthena)

Please ignore the above warnings and proceed.

Step 2: Secure the credentials and set environment variables

a. Secure the credentials to access the S3 bucket “my-test-bkt” using the role “ReadWrite-myrole” created for RStudio.

s3_role <- assume_role(profile_name = NULL,
region_name = “us-east-1”,
role_arn = “arn:aws:iam::<AWSAccountID>:role/ReadWrite-myrole”,
role_session_name =
sprintf(“RAthena-session-%s”, as.integer(Sys.time())),
duration_seconds = 3600L, set_env = FALSE)

b. Set the environment variables for the R session to establish the connection with S3 using the variable “s3_role” created above.

Sys.setenv("AWS_ACCESS_KEY_ID" = s3_role$AccessKeyId
"AWS_SECRET_ACCESS_KEY" = s3_role$SecretAccessKey,
"AWS_DEFAULT_REGION" = "us-east-1",
"AWS_SESSION_TOKEN" = s3_role$SessionToken)

Step 3: Read, write & append data in S3

a. Read the file “Sales_Syed.csv” available inside the S3 bucket “my-test-bkt” using the environment variables defined above.

temp_sales <- s3read_using(FUN = read.csv, bucket = "my-test-bkt",
object = Sales_Syed.csv, row.names = FALSE)
nrow(temp_sales)
head(temp_sales)

b. Write the data (dataframe) above as a new file in the same S3 bucket.

s3write_using(temp_sales, FUN = write.csv, row.names = FALSE,
bucket = "my-test-bkt",
object = "data_write_new_append_Role_R.csv")

c. Read the file to verify the row count and contents

new_data <- s3read_using(FUN = read.csv, bucket = "my-test-bkt", 
object = data_write_new_append_Role_R.csv
row.names = FALSE)
nrow(new_data)
head(new_data)

d. As S3 does not allow the modification of the objects, a workaround (read, append & replace) needs to be followed in order to simulate the “append scenario” i.e. the file that needs to be appended must be read as the dataframe and the new data must be appended within the R session. The appended dataframe must be written back to S3 replacing the old file.

append_data = new_data
append_data$orderid = append_data$orderid + 100
full_data = rbind(new_data, append_data)
nrow(full_data)
s3write_using(full_data, FUN = write.csv, row.names = FALSE,
bucket = "my-test-bkt",
object = "data_write_new_append_Role_R.csv")

e. Read the appended file to verify the row count and contents

new_data <- s3read_using(FUN = read.csv, bucket = "my-test-bkt",
object = data_write_new_append_Role_R.csv
row.names = FALSE)
nrow(new_data) #

As expected, the appended file has 20 records — 10 old & 10 new

Step 4: Connecting Redshift from RStudio

a. Download Redshift jdbc driver by executing the code below. The right version of jdbc driver is essential for the RJDBC library to work.

download.file('http://s3.amazonaws.com/redshift- downloads/drivers/RedshiftJDBC41-1.1.9.1009.jar','RedshiftJDBC41-1.1.9.1009.jar')

b. Create a jdbc connection channel using the code snippet below. Replace the placeholders within “<>” with values specific to your environment.

driver <- JDBC("com.amazon.redshift.jdbc41.Driver", "RedshiftJDBC41-1.1.9.1009.jar", identifier.quote="`")  
url <- "<JDBCURL>:<PORT>/<DBNAME>?user=<USER>&password=<PW>
conn <- dbConnect(driver, url)

Step 5: Read Redshift tables from RStudio

Once the connection is established, read one of the tables available in “SYED_TEST” schema.

temp_sales_rs = dbGetQuery(conn, "SELECT * FROM SYED_TEST.TEMP_SALES")nrow(temp_sales_rs)
head(temp_sales_rs, 3)

Step 6: Write the data from RStudio to Redshift

The write scenario can be classified into 3 categories.

  1. Record by record insert
    2. Bulk data insert
    3. CDC insert i.e. Inserting changed data only

Scenario-1: Record by record insert

The code snippet given below, can be used to insert the data in Redshift one record at a time.

Scenario-2: Bulk data insert

As record by record insert is always not possible, it’s essential to find a way to bulk load the data to Redshift tables. Unfortunately, jdbc does not support bulk load directly and bulk load using odbc connection is slow and not recommended. Our recommended approach for bulk loading data to Amazon Redshift from RStudio is as follows.

  1. Split the dataframe into multiple smaller chunks & write small temporary files to S3.
  2. Copy all the smaller files from s3 to Redshift using COPY command.
  3. Delete all the temporary files from S3.

Note: All the above steps can be achieved through “redshiftTools” library along with “RPostgres”.

my_data = dbGetQuery(conn, "SELECT * FROM TAB_TEST.TEMP_SALES");
head(my_data)
my_data$orderid = my_data$orderid+1;
rs_replace_table(my_data, dbcon=conn,
table_name ="SYED_TEST.TEMP_SALES",
bucket="my-test-bkt")

Scenario-3: Inserting only changed records (CDC insert)

There could be situations when the new records alone needs to be inserted into the table. The existing records (identified using the keys) must be discarded. In that case, use the following code.

my_data = dbGetQuery(conn, "SELECT * FROM SYED_TEST.TEMP_SALES");
head(my_data)
my_data$orderid = my_data$orderid+1;rs_upsert_table(my_data, dbcon=conn,
table_name = 'SYED_TEST.TEMP_SALES',
bucket="my-test-bkt", keys=c('orderid'))

Finding an efficient way to make the R programs to interact with the cloud data sources like Amazon S3 and Amazon Redshift is extremely important to make the existing R based Machine learning programs/models work and continue to yield the benefits.

Organizations are reaping benefits through cloud migration by reducing the infrastructure cost and the existing R based analytical investments can follow the steps explained above to fast track the cloud migration and help the organizations not only to save the cost but also continue to get the ROI from their existing R based Analytical Programs/models.

About the Author

Syed Abdul Sathar Syed Allaudeen is an experienced data scientist and analytics project manager with a demonstrated history of working in the information technology and services industry. You can reach him on LinkedIn.

--

--

Cognizant AI
CognizantAI

We help clients create highly-personalized digital experiences, products and services at every touchpoint of the customer journey.