IBM Object Storage 2 Spark Library
Read and write data between Spark and IBM’s Object Storage services
Editor’s note: This article was written before the Nov. 2017 rebranding of “IBM Bluemix” to “IBM Cloud.” Please excuse any discrepancies in this article and in current versions of various documentation sites.
When the IBM Spark and Object Storage services were first offered via IBM Bluemix a number of years ago, before IBM Data Science Experience (DSX), one had to manually set values in the underlying Hadoop Configuration object for Spark in order to set up a
swift protocol connection to Object Storage.
Instructions to do this were provided in various locations on Bluemix (and later DSX). The recipe consisted of manually looking up your credentials in Bluemix and then copying/pasting a small block of code into your Jupyter PySpark, or Scala notebook.
That approach is still perfectly valid today, but it became tedious. So, I wrote
This library does one simple thing. It encapsulates the code that was previously needed to be copied/pasted. It’s not magic or impressive in any way, but it makes things a bit easier. It works for any Spark service (even a local standalone installation) and with Python, Scala and R.
Multiple Object Storage offerings
IBM offers multiple Object Storage services through Bluemix and Softlayer. Connections to all of these services are supported by
ibmos2spark. These services differ by their authentication methods, underlying APIs, pricing structure, and physical region resiliency options. Two services are based on the
swift protocol (from Open Stack) and the other two adhere to the Amazon S3 protocol. Here are the four services IBM currently offers:
- Cloud Object Storage: S3-compatible APIs hosted on Bluemix
- Cloud Object Storage (IaaS): S3-compatible APIs and authentication hosted on SoftLayer
- Open Stack Object Storage (IaaS): Swift protocol API offered through SoftLayer
- Open Stack Object Storage: Swift protocol API offered through Bluemix
There are only two things you can do with the
- Configure a connection
- Get the path to a data object stored in a container or bucket
For example, the following Python code snippet shows you how to connect to a Open Stack Object Storage instance provisioned in Bluemix.
Documentation for connecting to each service is provided in the
ibmos2spark READMEs for the different languages.
You might notice slight, though trivial, variations using
ibmos2spark with the different Object Storage types. This is due to the different authentication protocols, Object Storage instance types, and the ways that credentials are provided to IBM customers via dashboards for those services (such as attaining credentials via IBM Data Science Experience, Bluemix, or through a SoftLayer account). Please take care to read the documentation and examples.
stocator library (required for
ibmos2spark) is now pre-installed on all IBM Apache Spark™ services, which support IBM DSX and IBM Analytics Engine. Additionally, the
ibmos2spark library is pre-installed when interacting with your Spark service via a Python or Scala kernel. (If you’re using an R kernel, you’ll need to follow the installation instructions, which are trivial.)
When you create a new project in DSX, an Apache Spark service and Object Storage instance are associated to that project. These two services provide the backbone for most data science projects: compute and data storage. Either a previously existing Object Storage instance and container may be associated with your project, or a new instance and container will be created.
Data can be added to your Object Storage container in various ways with different tools, depending on your service. If you’re using Cloud Object Storage on Bluemix, a Bluemix command-line tool (`bx`) is provided. The Cloud Object Storage (IaaS) hosted on SoftLayer uses an Amazon S3-compatible API. You can use the
aws command-line tools to push data to your instance. If you’re using either of the Open Stack Object Storage types, you can use the
swift client library.
For demonstration purposes, you can also add data from within a DSX Project or Notebook in your browser. Find the data tab on the right-side pane, and then drag and drop local data files into your Object Storage.
For any data objects found in the data pane, one can use a pull-down menu to insert code into the currently selected Notebook cell. That code will facilitate retrieval of data into your Spark service environment. In order to use the
ibmos2spark library, select
Insert to code -> Insert SparkSession Setup.
This code imports the
ibmos2spark package, inserts your credentials, then sets up the Hadoop configuration by calling the necessary
ibmos2sparkfunctions. Then it provides you with a path to your object in Object Storage. In general, that path looks something like this:
In the example above, I configured a container called
ibmos2sparkdemo, so the full path is
You can then use that path to load the data in Spark.
rdd = sc.TextFile(path_1)
If your file is a
.csv, you can read it directly into a DataFrame.
df= sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load(path_1)
Dealing with multiple objects and multiple Object Storage instances
Insert to code menu in DSX is convenient because it can save you a bit of typing if your data source is trivial. However, this assumes that the Object Storage and container are associated with your particular DSX project.
Insert to code menu can still be used if you’re accessing a large number of files within the same container/bucket. Spark can read data from a path using wildcards. For example:
path_1 = bmos.url('ibmos2sparkdemo', '*.txt')
Spark will now retrieve all the objects in that container which end in
rdd = sc.TextFile(path_1)
Multiple Object Storage instances
If you need to access data in a different Object Storage instance or container, then you’ll still need to log in to Bluemix to find the appropriate credentials. Furthermore, if you’re working in a situation with multiple Object Stores and multiple users, you should protect your data by appropriately restricting each user’s privileges and create backup containers to prevent loss of data.
Finding your credentials
Unfortunately, getting the correct credentials for your Object Store from Bluemix isn’t totally straightforward. The following example demonstrates how to do this with a Cloud Object Storage instance hosted on Bluemix (now, “IBM Cloud”).
After provisioning a new Cloud Object Storage instance in Bluemix, navigate to that instance’s dashboard. From there, find the
Service Credentials tab. Then click on the
New Credentials. In this example, I choose the
Role to be
Manager in order to generate credentials with full privileges. I also choose to
Auto Generate my
Service ID value (as you will see, it may be useful for you to create your own
Service ID explicitly). You’ll then see your new credentials in the Dashboard. You can click to show them as JSON.
Working in Python, I copy/paste those credentials to a Jupyter Notebook running on DSX. But we’re not done yet. Those credentials aren’t exactly what you need for
ibmos2spark. From your Cloud Object Storage dashboard in Bluemix, select the
Endpoint tab. You’ll find a number of different URLs that may be used, depending upon the regional resiliency that was selected when creating your buckets.
If your Object Storage instance is an Open Stack Object Storage instance hosted on Bluemix, it’s easier. Again, you’ll find credentials (possibly more than one) in the
Service Credentials tab from your instance’s dashboard. However, you can use these directly with
ibmos2spark without modification.
Once you know the credentials for your various Object Storage instances, you can configure multiple connections to those instances with the
One more thing
If you now go to your Cloud Object Storage Dashboard to see your new data, you may be surprised with what you find if you’re not familiar with saving data from Spark. Your data are broken into multiple files, one for each partition in your RDD. As well, the data are found in a kind of hierarchical structure and include a
_SUCCESSS file. All of this is normal, of course. When you load the data later with
sc.textFile, Spark will recreate an RDD with the same number of partitions.
In the future, the IBM Cloud plans to simplify the Object Storage experience. But for now, hopefully
ibmos2spark and the information here will make navigating these choices a little easier.