IBM Watson Studio Spark executions using RStudio IDE

Mahesh Kurapati
5 min readSep 28, 2016

--

In this post I will show you how to use the IBM Watson Studio Spark executions using RStudio IDE, which launched from Watson Studio projects.

RStudio IDE

RStudio is the premier integrated development environment (IDE) for R programmers. Watson Studio provides a convenient way of loading and executing R scripts.

Spark service

Watson Studio provides creating Spark execution environments inside projects to execute Spark programs in the cloud.

Launch RStudio IDE

Refer RStudio launch tutorial. This documentation targets RStudio with Project and it is not yet available in all data centers.

Running Spark programs from RStudio

RStudio uses the new sparklyr package (http://spark.rstudio.com/index.html) to connect with the Spark kernel gateway on the cloud using Spark-as-a-Service interactive APIs. The sparklyr package includes a dplyr interface to Spark data frames as well as an R interface to Spark’s distributed machine learning pipelines.

You can use your existing Spark instances from RStudio. To use this feature, run the following steps:

1 Load and display available Spark instances

2 Connect to a Spark instance

3 Run dplyr APIs and Spark’s distributed machine learning libraries

4 Display tables for Spark loaded data sets

5 View logs for Spark kernel interaction

6 View Spark connect status and connect or disconnect

List available Spark instances

When you start RStudio two files are created in the working directory (don’t delete them!):
1) config.yml file — Lists all of your available Spark instances.
2) .Rprofile file — Configures your Spark environment.

These files are created under your home directory, /home/rstudio. If the working directory is different from the home directory, you can copy the config.yml and .RProfile files to your current working directory.

You can load and display Spark instances by using the load_spark_kernels() and display_spark_kernels() R functions . Example:

This function lists only your currently available Spark instances. If you need another Spark instance, create it in Watson Studio Environments.

Connect with the selected Spark instance

To connect to Spark, run the spark_connect R function. For example:

sc <- spark_connect(config = kernels[1])

After this Spark context is created, all subsequent operations will be executed using this Spark instance:

Once connected to Spark we can see connection status like below

Run dplyr APIs and Spark’s distributed machine learning libraries

To run dplyr functions, load the dplyr package and then run the copy_to function using the Spark context. For example:

library(dplyr) 
localDF <- data.frame(name=c("John", "Smith", "Sarah", "Mike", "Bob"), age=c(19, 23, 18, 25, 30))
sampletbl <- copy_to(sc, localDF, "sampleTbl")

This creates a Spark data frame on the remote kernel based on a local R data frame, and displays the local references in the Spark view:

View the table for Spark loaded data sets

Spark View shows all of the remote Spark data frames. You can click on the table icon to show sample views of these tables.

View the log for Spark kernel interaction

You can select the Logs icon to view all of the calls to the Spark instance.

View Spark connect status and connect or disconnect a service

You can view the connection status on the Spark View, and you can connect to or disconnect from a Spark service.

Connect

Disconnect

Read Project files in Spark

RStudio provides utility function get_project_asset_path() to simplify access to project asset files from spark jobs. Here is the example to load project file in spark and create spark data frame

# R interface for Apache Spark
library(sparklyr)

library(dplyr)

# load kernels
kernels <- load_spark_kernels()

# display kernels
display_spark_kernels()

# connect to our spark kernel
sc <- spark_connect(config = kernels[1])

# create a path to Tim's test bucket
path <- get_project_asset_path("airline15krows.csv")

# read using sparklyr package
airline15krows_tbl <- spark_read_csv(sc,name = "airline15krows", path = path, delimiter="|", infer_schema = FALSE)

# list all tables
src_tbls(sc)
head(airline15krows_tbl,4)

Examples

You can find the example R script files in the /ibm-sparkaas-demos folder under your home directory. These examples demonstrate scenarios you can run with Spark in RStudio.

spark-kernel-basic.R

Creates simple R data frames and generates remote Spark data frames based on the local R data frames. Also runs some basic filters and DBI queries.

sparkaas_mtcars.R

Loads the popular mtcars R data frame and then generates a Spark data frame for the mtcars data frame. It then does transformations to create a training data set and runs a linear model on the training data set.

sparkaas_flights.R

Loads some larger data sets, creates ggplot for delay and runs windows functions. See sparklyr — R interface for Apache Spark for more information.

See also sparklyr Examples for more examples.

Mahesh Kurapati is an Advisory Software Engineer with the IBM Analytics team. Mahesh’s primary focus is on the development of various micro-services for IBM Data Science Experience. Mahesh is involved in the development of various Sparkling.data features and SparkaaS integration with RStudio. With more than 20 years of experience in software development, Mahesh has contributed key functionalities to IBM products including SPSS Statistics, SPSS Modeler, and SPSS Analytics Server.

Originally published at datascience.ibm.com on September 28, 2016.

--

--