IBM Watson Studio Spark executions using RStudio IDE

In this post I will show you how to use the IBM Watson Studio Spark executions using RStudio IDE, which launched from Watson Studio projects.

RStudio IDE

Spark service

Launch RStudio IDE

Running Spark programs from RStudio

You can use your existing Spark instances from RStudio. To use this feature, run the following steps:

1 Load and display available Spark instances

2 Connect to a Spark instance

3 Run dplyr APIs and Spark’s distributed machine learning libraries

4 Display tables for Spark loaded data sets

5 View logs for Spark kernel interaction

6 View Spark connect status and connect or disconnect

List available Spark instances

These files are created under your home directory, /home/rstudio. If the working directory is different from the home directory, you can copy the config.yml and .RProfile files to your current working directory.

You can load and display Spark instances by using the load_spark_kernels() and display_spark_kernels() R functions . Example:

This function lists only your currently available Spark instances. If you need another Spark instance, create it in Watson Studio Environments.

Connect with the selected Spark instance

sc <- spark_connect(config = kernels[1])

After this Spark context is created, all subsequent operations will be executed using this Spark instance:

Once connected to Spark we can see connection status like below

Run dplyr APIs and Spark’s distributed machine learning libraries

library(dplyr) 
localDF <- data.frame(name=c("John", "Smith", "Sarah", "Mike", "Bob"), age=c(19, 23, 18, 25, 30))
sampletbl <- copy_to(sc, localDF, "sampleTbl")

This creates a Spark data frame on the remote kernel based on a local R data frame, and displays the local references in the Spark view:

View the table for Spark loaded data sets

View the log for Spark kernel interaction

View Spark connect status and connect or disconnect a service

Connect

Disconnect

Read Project files in Spark

# R interface for Apache Spark
library(sparklyr)

library(dplyr)

# load kernels
kernels <- load_spark_kernels()

# display kernels
display_spark_kernels()

# connect to our spark kernel
sc <- spark_connect(config = kernels[1])

# create a path to Tim's test bucket
path <- get_project_asset_path("airline15krows.csv")

# read using sparklyr package
airline15krows_tbl <- spark_read_csv(sc,name = "airline15krows", path = path, delimiter="|", infer_schema = FALSE)

# list all tables
src_tbls(sc)
head(airline15krows_tbl,4)

Examples

spark-kernel-basic.R

sparkaas_mtcars.R

sparkaas_flights.R

See also sparklyr Examples for more examples.

Mahesh Kurapati is an Advisory Software Engineer with the IBM Analytics team. Mahesh’s primary focus is on the development of various micro-services for IBM Data Science Experience. Mahesh is involved in the development of various Sparkling.data features and SparkaaS integration with RStudio. With more than 20 years of experience in software development, Mahesh has contributed key functionalities to IBM products including SPSS Statistics, SPSS Modeler, and SPSS Analytics Server.

Originally published at datascience.ibm.com on September 28, 2016.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store