Efficient way to connect to Object storage in IBM Watson Studio — Spark Environments

Rachit Arora
3 min readDec 7, 2018

--

Watson Studio has integrated with a number of Spark engines since it was first released as Data Science Experience in 2016. These include IBM Apache Spark as a Service, IBM Analytics Engine, and AWS EMR. These services are still integrated with Watson Studio, but Watson studio also support a new Spark engine that is available by default for all Watson Studio users. If you want to try Spark environments you can get more details here

Watson Studio Spark environments offer many benefits:

  • Spark kernels on-demand — save time and energy to focus your analysis; create a Spark environment in Watson Studio and launch directly into a notebook.
  • Configurable, elastic compute — configure your Spark environment and choose your kernel hardware configurations from Watson Studio.
  • Easily share your environment — Spark environments are project assets, so they can easily be used by your collaborators.
  • Multiple language support — choose from the most popular languages for your Spark kernels (Python 2, Python 3, R, Scala).

It is recommended to use Object Storage service like IBM Cloud Object Storage to store your data. This is a service to store, manage and access your data via our self-service portal and RESTful APIs. Connect applications directly to object storage and integrate Cloud services. Cloud Object Storage makes it possible to store practically limitless amounts of data, simply and cost effectively. It is commonly used for data archiving and backup, web and mobile applications, and as scalable, persistent storage for analytics. Flexible storage class tiers with a policy-based archive let you effectively manage costs while meeting data access needs.

Spark Environment user can bring their Data from IBM Cloud Object Storage for analytics. There are many connectors and ways a user can create Dataframe directly from Objects stored in COS or persist there data to COS.

One of the optimized connector to use is strocator libraries.

Reading from COS

Here is the sample configuration in Python 3 to read from COS using Access key and secret

hconf = sc._jsc.hadoopConfiguration()
hconf.set("fs.cos.servicename.endpoint", "<BUCKET_ENDPOINT>")
hconf.set("fs.cos.servicename.access.key","****")
hconf.set("fs.cos.servicename.secret.key","*********")
df = spark.read.csv("cos://<bucket>.servicename/filename.csv")

Here is the sample configuration in Python 3 to read from COS using IAM Creds

hconf = sc._jsc.hadoopConfiguration()
hconf.set("fs.cos.servicename.endpoint", "<BUCKET_ENDPOINT>")
hconf.set("fs.cos.servicename.iam.api.key","****")
df =spark.read.csv("cos://<bucket>.servicename/filename.csv")

In the above code “servicename” can be any name you want to give. BUCKET_ENDPOINT is the endpoint for your cos bucket depending on the region. You can get the endpoint from this link.

It is recommended to use Private endpoint for the cos bucket for better performance and <bucket> is the bucket name where you want to write.

Writing to COS

Here is the sample configuration in Python 3 to write a dataframe in COS using Access key and secret

hconf = sc._jsc.hadoopConfiguration()
hconf.set("fs.cos.servicename.access.key", "**********")
hconf.set("fs.cos.servicename.secret.key", "********")
hconf.set("fs.cos.servicename.endpoint", "<BUCKET_ENDPOINT>")
df.write.format("csv").save("cos://<bucket>.servicename/filename.csv")

Here is the sample configuration in Python 3 to write a dataframe in COS using IAM Creds

hconf = sc._jsc.hadoopConfiguration()
hconf.set("fs.cos.servicename.iam.api.key", "**********")
hconf.set("fs.cos.servicename.endpoint", "<BUCKET_ENDPOINT>")
df.write.format("csv").save("cos://<bucket>.servicename/filename.csv")

Read and write example in Scala

Here is a sample code to read data from COS in scala using IAM Creds

sc.hadoopConfiguration.set("fs.cos.servicename.endpoint","<BUCKET_ENDPOINT>")
sc.hadoopConfiguration.set("fs.cos.servicename.iam.api.key","***")
val rdd=sc.textFile("cos://<bucket>.servicename/filename.csv")
rdd.count()

Here is a sample code to write data to COS in scala using IAM Creds

sc.hadoopConfiguration.set("fs.cos.servicename.endpoint","<BUCKET_ENDPOINT>")
sc.hadoopConfiguration.set("fs.cos.servicename.iam.api.key","***")
df.write.format("csv").save("cos://<bucket>.servicename/filename.csv")

Try out Spark environments from IBM and after trying this feature and please give your feedback on this article in a comment below or a tweet Rachit Arora.

--

--

Rachit Arora

Software Architect. Expert in building cloud services. Loves Kubernetes, containers and Bigdata.