Access IBM Analytics for Apache Spark from RStudio

In this post I will show you how to use the IBM Analytics for Apache Spark service from the RStudio IDE, which is integrated into the Data Science Experience.

RStudio IDE

RStudio is the premier integrated development environment (IDE) for R programmers. Data Science Experience provides a convenient way of loading and executing R scripts.

IBM Analytics for Apache Spark service

The IBM Analytics for Apache Spark service is a managed service that lets you run Spark programs in the cloud.

Running Spark programs from RStudio

RStudio uses the new sparklyr package (http://spark.rstudio.com/index.html) to connect with the Spark kernel gateway on the cloud using Spark-as-a-Service interactive APIs. The sparklyr package includes a dplyr interface to Spark data frames as well as an R interface to Spark’s distributed machine learning pipelines.

You can use your existing Spark instances from RStudio. To use this feature, run the following steps:

1 List your available Spark instances 2 Connect to a selected Spark instance 3 Run dplyr APIs and Spark’s distributed machine learning libraries 4 Display tables for Spark loaded data sets 5 View logs for Spark kernel interaction

6 View Spark connect status and connect or disconnect

List available Spark instances

When you start RStudio two files are created in the working directory (don’t delete them!): 
 1) config.yml file — Lists all of your available Spark instances. 
 2) .Rprofile file — Configures your Spark environment.

These files are created under your home directory, /home/rstudio. If the working directory is different from the home directory, you can copy the config.yml and .RProfile files to your current working directory.

You can list Spark instances by using the list_spark_kernels() R function and store it in a variable called kernel. For example:

This function lists only your currently available Spark instances. If you need another Spark instance, create it in Data Science Explorer.

Connect with the selected Spark instance

To connect to Spark, run the spark_connect R function. For example:

sc <- spark_connect(config = kernels[1])

After this Spark context is created, all subsequent operations will be executed using this Spark instance:

Run dplyr APIs and Spark’s distributed machine learning libraries

To run dplyr functions, load the dplyr package and then run the copy_to function using the Spark context. For example:

library(dplyr) localDF <- data.frame(name=c("John", "Smith", "Sarah", "Mike", "Bob"), age=c(19, 23, 18, 25, 30)) sampletbl <- copy_to(sc, localDF, "sampleTbl")

This creates a Spark data frame on the remote kernel based on a local R data frame, and displays the local references in the Spark view:

View the table for Spark loaded data sets

Spark View shows all of the remote Spark data frames. You can click on the table icon to show sample views of these tables.

View the log for Spark kernel interaction

You can select the Logs icon to view all of the calls to the Spark instance.

View Spark connect status and connect or disconnect a service

You can view the connection status on the Spark View, and you can connect to or disconnect from a Spark service.

Connect

Disconnect

Examples

You can find the example R script files in the /ibm-sparkaas-demos folder under your home directory. These examples demonstrate scenarios you can run with Spark in RStudio.

spark-kernel-basic.R

Creates simple R data frames and generates remote Spark data frames based on the local R data frames. Also runs some basic filters and DBI queries.

sparkaas_mtcars.R

Loads the popular mtcars R data frame and then generates a Spark data frame for the mtcars data frame. It then does transformations to create a training data set and runs a linear model on the training data set.

sparkaas_flights.R

Loads some larger data sets, creates ggplot for delay and runs windows functions. See sparklyr — R interface for Apache Spark for more information.

See also sparklyr Examples for more examples.

Mahesh Kurapati is an Advisory Software Engineer with the IBM Analytics team. Mahesh’s primary focus is on the development of various micro-services for IBM Data Science Experience. Mahesh is involved in the development of various Sparkling.data features and SparkaaS integration with RStudio. With more than 20 years of experience in software development, Mahesh has contributed key functionalities to IBM products including SPSS Statistics, SPSS Modeler, and SPSS Analytics Server.

Originally published at datascience.ibm.com on September 28, 2016.