Is it possible to leverage your current Spark cluster to build Deep Learning models?
Can terabytes of data stored in HDFS, Hive, HBase be analyzed?
Diving into Intel’s BigDL
Apache Spark has rapidly gotten popular over the past couple of years. This comes from its simplicity, speed and support, also referred as 3 S’s of Spark. Many companies leveraged the Hadoop and Spark environments to build powerful data processing pipelines. These pipelines were built to pre-process huge volumes of data on distributed clusters and draw insights from it for business growth. As Deep Learning gained momentum for its high accuracy predictions and analysis potential, many companies wanted to take advantage of the models that could help improve their businesses further. Intel had such corporate customers with huge data pipelines already built and deployed in their Spark/Hadoop clusters. There were a couple of concerns from these customers on how can deep learning models be applied on these datasets.
1) Can the same pipeline be leveraged for Deep Learning?
2) Can Deep Learning be done efficiently for these large datasets, i.e. Deep Learning at scale?
3) Can the existing Spark/Hadoop cluster be used?
These questions arose because of the obvious need for a consistent environment where moving these large datasets across different clusters can be avoided. Moreover, moving proprietary datasets from an already existing infrastructure can be a security risk. The earlier experiments for answering these questions involved trying to add existing Deep Learning frameworks over Spark, but they weren’t coherent. This led to the development of BigDL platform on top of Spark, and hence naturally inheriting the 3 S’s: simplicity, speed and support with Deep Learning features.
BigDL is a distributed deep learning library that has rapidly started growing in the field of big data analysis. It is natively integrated with Spark and Hadoop ecosystems because of which it supports significant features like incremental scalability, that allows the models to be trained in any number of clusters depending on the size of data. It provides support to build an end-to-end data-analytics, deep learning and AI pipeline. It performs data-parallel distributed training to achieve high scalability. Since it’s an open source release, BigDL users have constructed numerous analytics and deep learning applications such as visual similarity, parameter synchronization, scaling and convergence etc. on Spark and Big Data platforms.
Deep Learning applications can be written as a standard spark program. These libraries that are unified in the Spark framework can read high volumes of data. Moreover, it has support for Python libraries like Numpy, Scipy, NLTK, Pandas, etc. It is integrated with TensorBoard for visualizations and also supports loading of existing Torch models. One of the most important reasons for the enterprise customers wanting to use BigDL and Spark is that in addition to the fact that BigDL is faster than TensorFlow, it also enables them to retrain the models quicker because of parallel computation.
Analytics Zoo Library
Machine Learning in Spark is still in its infancy when compared to numerous standalone libraries available in Python ecosystem. Most of these libraries such as Keras, TensorFlow and PyTorch are not consistent with Spark since they do not support Spark’s underlying core framework that enables distributed computing. Analytics Zoo is a library developed by Intel that is trying to bridge this gap in Spark. It provides a rich set of high-level APIs to seamlessly integrate BigDL, Keras and TensorFlow programs into Spark pipelines. It has several built-in deep learning models for object detection, image classification, text classification, recommendations, etc. The library also provides end-to-end reference use cases such as anomaly detection, fraud detection and image augmentation to apply machine learning on real-world problems.
To put things into more perspective, the following section provides a brief tutorial on BigDLand Analytics Zoo, showing how easily they can be used to implement transfer learning using pre-trained models and train it on a Spark cluster.
Transfer Learning using BigDL and Analytics Zoo
In this tutorial, a small dataset containing images of ants and bees is used to perform transfer learning on ResNet-50. This model is pre-trained to classify a given image into one of its 1000 labels. Transfer learning is used to retrain this model for this specific use case and make predictions on the training set containing ants and bees. BigDL and Analytics Zoo enables this training on Spark’s distributed framework. Note that the original ResNet-50 doesn’t contain ants and bees as one of its labels.
BigDL and Analytics Zoo can be installed using pip as follows:
pip install BigDL
pip install analytics-zoo#for Python3
pip3 install BigDL
pip3 install analytics-zoo
Before beginning, download the ResNet 50 pre-trained model and training and validation datasets from this repository. More models are available on Analytics Zoo website here. You will need to unzip the compressed data.
Let’s begin by defining imports and initializing spark context using init_nncontext function from Analytics Zoo. Paths to pre-trained model, training and validation data are also defined here.
Next, create Spark UDFs to extract filenames. Labels are assigned by checking if the file path contains the keyword ‘ants’ or ‘bees’. Using these two UDFs, construct the training and validation dataframes.
To properly construct the model, all images need to be standardized. Analytics Zoo has APIs to create these transformations and chain them so that they are applied in sequence.
Pre-trained ResNet-50 is loaded using the model file downloaded earlier.
The last 5 layers of ResNet-50 are:
The last layer in the model is changed to output 2 classes (ants and bees) instead of 1000 that ResNet-50 is trained on. It is constructed with input dimension of 1000 and output dimension of 2 and a softmax activation. This layer is then added as the last element in the pipeline. Note that since transfer learning is being performed, the model can be trained for these 2 new classes in just 25 epochs! This shows the practicality of transfer learning.
Now, train the model on training dataframe and get predictions on validation dataframe. Spark allows faster training here across multiple clusters.
Finally, predict the classes on test data and show the image with its class.
An image from the test dataset showing the predicted label on top.
If the dataset was a lot larger, say it happens to be stored in HDFS, the same design pipeline can be used and it would have automatically scaled for a much larger cluster. BigDL makes data analytics of these huge datasets a lot faster and efficient. Moreover, other than strong integration with Spark dataframes it is tightly coupled with Spark SQL and Structured Streaming. For example, streaming data from Kafka can be passed to a similar BigDL UDF to make real-time predictions and classifications. Spark SQL can be added to perform querying and filtering at different stages, and even on the predictions.
A Jupyter notebook with all datasets is provided in this repository.