We are very excited to announce that today Apache Spark environments are available in beta for Watson Studio.
Watson Studio has integrated with a number of Spark engines since it was first released as Data Science Experience in 2016. These include IBM Apache Spark as a Service, IBM Analytics Engine, and AWS EMR. These services are still integrated with Watson Studio, but we now support a new Spark engine that is available by default for all Watson Studio users.
Building on lessons learned from other compute engines, Watson Studio Spark environments offer many benefits:
- Spark kernels on-demand — save time and energy to focus your analysis; create a Spark environment in Watson Studio and launch directly into a notebook.
- Configurable, elastic compute — configure your Spark environment and choose your kernel hardware configurations from Watson Studio.
- Easily share your environment — Spark environments are project assets, so they can easily be used by your collaborators.
- Multiple language support — choose from the most popular languages for your Spark kernels (Python 2, Python 3, R, Scala).
Apache Spark is one of the most popular distributed computing engines that can easily scale to handle the needs of big data processing. Spark includes officially supported libraries, including SparkSQL and SparkML. The former enables writing SQL against Spark DataFrames and the latter includes a number of algorithms and
pipelinemethod for reproducible machine learning (see this example notebook).
How to get started
All you need to get started is a Watson Studio account: Sign up here!
- Once you have an account, create a new project (or use any existing Cloud Object Storage project).
- Create a new environment definition:
3. Give your environment a name, and choose
Spark for the Type, choose your configuration and click
4. With your environment definition created, you can now easily create a new notebook:
For this beta release, you can create Spark environment with 1 driver and up to 5 executors. The driver and executors can be configured with 1 vCPU and 4 GB RAM or 2 vCPU and 8 GB RAM.
Just like other environments in Watson Studio, it is very easy to change the environment associated with your notebook. Simply click on the Notebook Info icon and go to the Environment tab, where you can change to any environment definition in the project:
For additional information on this functionality, please visit our official documentation. You can also get started by trying out two notebooks from our community:
- Use Spark ML and Python to detect network intrusions
- Use Spark ML and Scala to detect network intrusions
If you have been a long time Watson Studio user, you should be familiar with Apache Spark as a Service. Spark environments are different from this other Spark service because they are elastic and configurable which enables greater control on the resources available for each kernel. Spark environments also provide isolated compute resources rather than sharing a single cluster across many collaborators.
We are very excited to introduce this new functionality that will make it easier for all Watson Studio users to easily scale their compute to handle any big data workloads.
After trying this feature please let us know your feedback in a comment below or a tweet @IBMDataScience!