Read and Write Data To and From Bluemix Object Storage in RStudio

In this article, you will learn how to bring data into RStudio on Data Science Experience from Bluemix Object Storage, and write data from RStudio back into Bluemix Object Storage, using ‘sparklyr’ and ibmos2sparklyr to work with Spark.

Using sparklyr and ibmos2sparklyr (Stocator’s swift2d) to work with Spark

First, connect to the Spark service using sparklyr’s spark_connect function. You can refer to this post for details.

#connect to spark 
library(sparklyr)
library(dplyr) sc <- spark_connect(config = "Apache Spark-ic")

Install the ibmos2sparklyr library and load the library to connect to Object Storage.

library(devtools) 
devtools::install_url("https://github.com/ibm-cds-labs/ibmos2spark/archive/0.0.7.zip", subdir= "r/sparklyr/",dependencies = FALSE)
library(ibmos2sparklyr)

Next set up credentials for Object Storage. Get the credentials generated from Jupyter Notebooks, or replace the values in the following snippet with your Object Storage credentials.

creds <-list(auth_url = "https://identity.open.softlayer.com", project = "object_storage_216c032f_3f57_4763_ae97_5c6a83a0d523", project_id = "e097bbd898534ed1ad0e45c82baedb2d", region = "dallas", user_id = "36d94b2086de4caf8852289eb4594691", domain_id = "da5b6dd1c8374f67b1050172badbef8c", domain_name = "837523", username = "member_0d7d372387d59ef4ae1a08ae3d74dc955ef9c38b", password = "XXXXXX", container = "testObjectStorage")

The configuration name can be any any name you like, which allows for multiple configurations. Call bluemix() to set our object storage credentials in Hadoop configuration. Note the third argument is the credentials that we initialized above.

configurationname = "keystone"
bmconfig = bluemix(sparkcontext=sc, name=configurationname, credentials = creds)

Use sparklyr’s spark_read_csv function to read from IBM Bluemix Object Storage into the Spark context in RStudio. The first argument is the sparkcontext that you connected to. The second argument is the name of the table that you can refer within Spark. The third argument is the path to your Object Storage file, which we can generate using bmconfig$url(). You can also specify repartition to number to parallelize reads.

#change the container name if you want to store it to 
#another existing container.
container = creds['container']
objecttoread = "Advertising.csv"
sparkobject_name = "dataFromSwift"
data = sparklyr::spark_read_csv(sc, sparkobject_name,bmconfig$url(container,objecttoread))

Use src_tbls to check that you read the table in Spark. Use head to check the dataframe.

src_tbls(sc)
head(data,4)

You can also write the dataframe back to Object Storage using spark_write_csv.

#Write using sparklyr package 
objecttowrite = "OutputAdvertisement.csv" sparklyr::spark_write_csv(data,bmconfig$url(container,objecttowrite))

For more information, see ibmos2sparklyr.