IBM Watson Studio Spark Environments Generally Available!

Sumit Goyal
Sep 6, 2018 · 4 min read

Today we are excited to announce the general availability of Spark environments in IBM Watson Studio!

Spark environments were available in beta (read the announcement here) during the last month to give you a chance to tell us about your experience. We have now integrated your feedback and included several new and exciting features!

Spark environments define the hardware and software configurations to start custom Spark clusters on demand. Spark environments can be quickly scaled up or down for resources. This makes them well suited for a variety of use cases from trying out new machine learning algorithms on a sample data set to running large production workloads on the distributed computation engine. Spark environments can be used with tools like notebooks, model builder, or the flow editor in Watson Studio.

Spark environments are available by default for all Watson Studio users. You don’t have to provision or associate any external Spark service with your Watson Studio project. You simply select the hardware and software configuration of the Spark runtime service you need to run your tool and then when you start the tool with the environment definition, a runtime instance is created based on your configuration specifications. The Spark compute resources are dedicated to your tool alone and not shared with collaborators.

You can still share data files, libraries, and analysis results with your project collaborators. All you need to do is store this in the Cloud Object Storage associated with your project and everyone in your project can access your files.

To make it really easy for you to get started fast with a Spark environment, you can select one of the Spark defaults we include in every project (one for Python 3.5 and another for Scala 2.11). You can use one of these default environments to get to work quickly, without needing to create a custom environment definition.

Get started with your own Spark environment

If you want to specify your own Spark environment, then begin by creating an environment definition:

  1. From the Environments tab in your project, click New environment definition.
  2. Enter a name and a description.
  3. Select the environment type Spark to see the environment runtime configuration options and choose a hardware and software configuration. You can create a Spark environment with 1 driver and up to 10 executors. The driver and executors can be configured with 1 vCPU and 4 GB RAM or 2 vCPU and 8 GB RAM.
  4. With your environment definition in place, you can now create a notebook, a machine learning model or a modeler flow and select the Spark environment you just created as the runtime environment to use.

Spark environments are highly customizable both in terms of hardware and software options. When creating a Spark environment definition, you can choose from the most popular languages for your Spark cluster, namely Python 2.7, Python 3.5, R 3.4 and Scala 2.11. Depending upon your use case, you can also configure the hardware sizes for your Spark driver and executors. Should you need to change the hardware size for performance reasons, you can always go back and edit the environment definition even after you created it. You can do this from your project’s Environments page.

Spark environments and Watson Studio tools

You can create a Jupyter notebook directly from the Spark environment summary page if you like taking shortcuts. This way, the Spark environment you just created will be selected for you.

Not only that, now you can use Spark environments when you create a model or a Spark modeler flow. The Spark environment you create will appear in the list of Spark runtimes you can select from. Although Spark environments for model builder and modeler are still in beta, we encourage you to use these environments when running models and Spark modeler flows.

Track your usage

Keeping track of your Spark runtime usage for which you are billed is straightforward in Watson Studio. Your Spark environment starts consuming capacity unit hours (CUHs) as soon as your cluster starts running and stops when you stop the runtime. This means that you are charged only for what you use but you should remember to stop the run when your job is over — in case you forget, we stop it for you after some idle time.

Check out our documentation for more on how to calculate your CUHs consumption and where to stop active runtimes.

New to Spark or Jupyter notebooks?

No problem, we have several sample notebooks in the community section that teach you about how to do amazing things with Spark in Jupyter notebooks.

Stay tuned for more

We are dedicated to simplifying your work around machine learning and AI. We will keep on adding many new exciting features:

  • User software customization for custom package management
  • Access to Spark History Server and the Spark Application UI
  • And much more!

IBM Watson

AI Platform for the Enterprise

Sumit Goyal

Written by

Software Engineer, IBM Watson Studio. A data science enthusiast. Meet me on Twitter at @SumitG0yal

IBM Watson

AI Platform for the Enterprise