Using BigDL in Data Science Experience for Deep Learning on Spark
Huge thanks for the contributions from Yulia Tell and Yuhao Yang from Intel and Roland Weber from IBM in making this integration possible!
Deep Learning has become one of the most popular techniques used in the field of Machine Learning in recent years. The Data Science Experience (DSX) team has been excited about deep learning since before launching last year (we have a couple blogs on this topic: DL trends, Using DL in DSX).
As a data science platform, we make it easy to scale your analysis by providing a Spark cluster for all users. Whether working in notebooks or RStudio in DSX you have access to connect to this cluster to distribute workloads. Until recently, Spark batch processing was not used for Deep Learning since it required a lot of effort to optimize Spark’s compute engine for training deep neural networks. This is where Intel comes in, with their big data deep learning framework called BigDL. This blog will explain what BigDL is and how it can be used in Data Science Experience.
What is BigDL?
BigDL is a distributed deep learning framework for Apache Spark that was developed by Intel and contributed to the open source community for the purposes of uniting big data processing and deep learning (check out https://github.com/intel-analytics/BigDL). Built on the highly scalable Apache Spark platform, BigDL can be easily scaled out to hundreds or thousands of servers. In addition, BigDL uses Intel® Math Kernel Library (Intel® MKL) and parallel computing techniques to achieve very high performance on Intel® Xeon® processor-based servers (comparable to mainstream GPU performance).
BigDL helps make deep learning more accessible to the big data community by allowing developers to continue using familiar tools and infrastructure to build deep learning applications. BigDL provides support for various deep learning models (for example, object detection, classification, and so on); in addition, it also lets us reuse and migrate pre-trained models (in Caffe, Torch*, TensorFlow*, and so on), which were previously tied to specific frameworks and platforms, to the general purpose big data analytics platform through BigDL. As a result, the entire application pipeline can be fully optimized to deliver significantly accelerated performance.
As the following diagram shows, BigDL is implemented as a library on top of Spark, so that users can write their deep learning applications as standard Spark programs. As a result, BigDL can be seamlessly integrated with other libraries on top of Spark — Spark SQL and DataFrames, Spark ML pipelines, Spark Streaming, Structured Streaming, etc. — and can run directly on top of existing Spark or Hadoop clusters.
Highlights of the BigDL v0.3.0 release
Since its initial open source release in December 2016, BigDL has been used to build applications for fraud detection, recommender systems, image recognition, and many other purposes. The recent BigDL v0.3.0 release addresses many user requests, improving usability and additional new features and functionality:
• New layers support
• RNN encoder-decoder (sequence-to-sequence) architecture
• Variational auto-encoder
• 3D de-convolution
• 1D convolution and pooling
• Model quantization support
• Quantize existing (BigDL, Caffe, Torch or TensorFlow) model
• Converting float points to integer for model inference (for model size reduction & inference speedup)
• Sparse tensor and layers — Efficient support of sparse data
BigDL on DSX: a Perfect Fit
Since notebooks in DSX are already executed on a Spark cluster, it is very easy to get up and running with BigDL. The only tool you need to get started is a Data Science Experience notebook. Follow the steps below to install BigDL and confirm it is working. In future posts, we will show tutorials using BigDL on DSX.
Installation Guide for BigDL within IBM DSX
This section was written by Roland Weber in this StackOverflow post. You can follow along with this notebook to get up and running with BigDL in DSX.
If your notebooks are backed by an Apache Spark as a Service instance in DSX, installing BigDL is simple. But you have to collect some version information first.
- Which Spark version? Currently, 2.1 is the latest supported by DSX.
With Python, you can only install BigDL for one Spark version per service. - Which BigDL version? Currently, 0.3.0 is the latest, and it supports Spark 2.1.
If in doubt, check the download page. The Spark fixlevel does not matter.
With this information, you can determine the URL of the required BigDL JAR file in the Maven repository. For the example versions, BigDL 0.3.0 with Spark 2.1, the download URL is
https://repo1.maven.org/maven2/com/intel/analytics/bigdl/bigdl-SPARK_2.1/0.3.0/bigdl-SPARK_2.1-0.3.0-jar-with-dependencies.jar
For other versions, replace 0.3.0 and 2.1 in that URL as required. Note that both versions appear twice, once in the path and once in the filename.
Installing for Python
You need the JAR, and the matching Python package. The Python package depends only on the version of BigDL, not on the Spark version. The installation steps can be executed from a Python notebook:
- Install the JAR.
!(export sv=2.1 bv=0.3.0 ; cd ~/data/libs/ && wget https://repo1.maven.org/maven2/com/intel/analytics/bigdl/bigdl-SPARK_${sv}/${bv}/bigdl-SPARK_${sv}-${bv}-jar-with-dependencies.jar)
Here, the versions of Spark (sv) and BigDL (bv) are defined as environment variables, so you can easily adjust them without having to change the URL.
2. Install the Python module.
!pip install bigdl==0.3.0 | cat
If you want to switch your notebooks between Python versions, execute this step once with each Python version.
After restarting the notebook kernel, BigDL is ready for use.
(Not) Installing for Scala
If you install the JAR as described above for Python, it is also available in Scala kernels.
If you want to use BigDL exclusively with Scala, better not install the JAR at all. Instead, use the %AddJar magic at the beginning of the notebook. It’s best to do this in the very first code cell, to avoid class loading issues.
%AddJar https://repo1.maven.org/maven2/com/intel/analytics/bigdl/bigdl-SPARK_2.1/0.3.0/bigdl-SPARK_2.1-0.3.0-jar-with-dependencies.jar
By not installing the JAR, you gain the flexibility of using different versions of Spark and BigDL in different Scala notebooks sharing the same service. As soon as you install a JAR, you’re likely to run into conflicts between that one and the one you pull in with %AddJar.
Hopefully after following along with those instructions you are ready to start using BigDL to train deep nets on Spark in DSX! If you prefer a Python notebook with all these steps you can copy this notebook written by the DSX development team. You can copy this notebook directly into a DSX project using the copy icon in the top right; this will let you start running the code in your Spark cluster in Data Science Experience.
This notebook also gives you some code to start using the BigDL framework. Stay tuned for a follow up post showing how to train models with BigDL. If you are interested to see examples of training models for fraud detection, sentiment analysis and others with BigDL, feel free to check out BigDL model zoo at https://github.com/intel-analytics/analytics-zoo.