Making Image Classification Simple With Spark Deep Learning


We, at Linagora, believe that all next software generation will integrate innovative features based on AI and Machine Learning (ML).

Two years ago, Linagora start the development of a collaborative open-source platform called OpenPaas. In this context, I started to develop innovative features based on ML and AI. Apache Spark, Scala, Hadoop, ML, are my favorite terms.

Today we have integrated automatic emails classification and we want to go further.

In this article we present how to run an example of Image Classification with Spark Deep Learning on Python 2.7. We will try to classify images of two persons : Steve Jobs and Mark Zuckerberg


To start, you will need to download and unzip the last version of Apache Spark ( Apache Spark is an open-source cluster-computing framework. Apache Spark is a fast, in-memory data processing engine with expressive development APIs to allow data workers to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets.

curl -O
tar xzf spark-2.1.1-bin-hadoop2.7.tgz

Let also get some images to work with in the article. We will use a personalities images of Steve Jobs and Mark Zuckerberg. Please download the zip file and decompress it from the next link.

Spark Deep Learning use TensorFlow to transform images on numeric features. TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by Google Brain Team to conduct machine learning and deep neural networks research. Please install tensorflow on your machine from the next link

Finally, you will probably need to install the following python packages to run spark-deep-learning on python

sudo pip install nose
sudo pip install pillow
sudo pip install keras
sudo pip install h5py
sudo pip install py4j

Run pyspark with spark-deep-learning library

spark-deep-learning library comes from Databricks and leverages Spark for its two strongest facets:

  • In the spirit of Spark and Spark MLlib, it provides easy-to-use APIs that enable deep learning in very few lines of code.
  • It uses Spark’s powerful distributed engine to scale out deep learning on massive datasets.

The library run on pyspark. Let’s start pyspark:

export SPARK_HOME=PATH/TO/spark-2.1.1-bin-hadoop2.7
export set JAVA_OPTS="-Xmx9G -XX:MaxPermSize=2G -XX:+UseCompressedOops -XX:MaxMetaspaceSize=512m"
$SPARK_HOME/bin/pyspark --packages databricks:spark-deep-learning:0.1.0-spark2.1-s_2.11 --driver-memory 5g

Below is a console output example.

Let’s code images classification on pyspark shell

The first step to applying deep learning on images is the ability to load the images. Deep Learning Pipelines includes utility functions that can load millions of images into a Spark DataFrame and decode them automatically in a distributed fashion, allowing manipulation at scale.

For images classification, we produce a model using Transfer Learning technique. Transfer learning or inductive transfer is a research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, knowledge gained while learning to recognize cars could apply when trying to recognize trucks.

Deep Learning Pipelines on Apache Spark enables fast transfer learning with a Featurizer (that transform images to numeric features). We combines in Spark InceptionV3 (a convolutional neural network trained for images classification) and logistic regression (a statistical method used on machine learning to analyse independent features (variables) that determine an outcome (in our case two kind of photos)). The DeepImageFeaturizer automatically peels off the last layer of a pre-trained neural network and uses the output from all the previous layers as features for the logistic regression algorithm.

Let’s see how to classify images with the model. We have already split our data in two parts: train_df and test_df (we will use test_df to test classification). Also we labeled images with 0 for Mark Zuckerberg and 1 for Steve Jobs. To show the predicted label for images from test_df, run the next code:

Below is a console output example. Column filePath indicate the path of the image and column prediction indicate the predicted label of the image:

It is also possible to compute a model accuracy. For this, we use MulticlassClassificationEvaluator to compute the accuracy of the model by testing the dataframe test_df.

Your shell will print on the screen the model accuracy (0.91).

Great! you know now how to quickly use Spark and Deep Learning for image classification. Now you can try with other images. Is very fast, you just need to change the input of train_df and test_df. Let’s take an example…

Running another sample very quickly

Please download and unzip flower photos with the following instructions:

curl -O
tar xzf flower_photos.tgz

Small changes on the code. Just replace the program input and run again.

Your shell will print on the screen the model accuracy (0.97).


Apache Spark is a very powerful platform with elegant and expressive APIs to allow Big Data processing.

We tried with success Spark Deep Learning, an API that combine Apache Spark and Tensorflow to train and deploy an image classifier. It is extremely easier (less than 30 lines of code). Our next objective is to test if we can deploy facial recognition model with this API

While this support is only available on Python, we hope that integration will be done very soon on other programming languages especially with Scala.