Artificial Neural Network with Spark MLlib

Sushil Kumar
Aug 25, 2017 · 4 min read

For past few weeks I have been taking an awesome Machine Learning course on Udemy. Today I reached to the Deep Learning section where I learned about Artificial Neural Networks. At first they seemed like other algorithms that I learned so far like (Regression, Classification, Reinforcement Learning etc.) but when I read a few articles on the prowess of ANNs I was taken aback by the power of these innocent looking networks.

I followed the course and created a ANN using H2O library in R and got around 86% accuracy without writing much code. Once I was done with R code I wanted to take Spark ML to spin and implement the same code there. I googled the ANN support with Spark ML and was glad to know that they support Multilayer Perceptron Classification which is essentially what my sample problem was. It was a classification problem which was solved used Feed Forward ANN.

In this post I’m going to show you how to create a Simple ANN, train it on the given dataset and then check its accuracy on test set.

Before we get started let me share the dataset with you. I got this dataset from my Machine Learning course on Udemy, link for which is here. You can find dataset here. Once you download the dataset you’ll find there are 10,000 observations and 14 variables.

So lets get started with the Spark script. As usual fire up your favorite IDE and add following dependencies to you build.sbt file.

libraryDependencies ++= Seq( "org.apache.spark" % "spark-core_2.11" % "2.1.0" , "org.apache.spark" % "spark-sql_2.11" % "2.1.0" , "org.apache.spark" % "spark-mllib_2.11" % "2.1.0" )

Then go and add App.scala file where we’ll start writing our Spark ML script.

We’ll start by loading our dataset. Create the SparkSession and load the file in an RDD.

val session = SparkSession.builder().appName("ANNDemo").master("local[*]").getOrCreate() val lines = session.sparkContext.textFile("C:\ScalaSpark\Churn_Modelling.csv") import session.implicits._ val rdd = lines.map(parseLine).toDS()

The parseLine function reads each line and convert it into a LabeledPoint object because that’s how the Spark ML classification algorithms expect the input.

One thing to note here is that we’ll be using the latest ML API which supports DataFrames, so make sure you import the correct version of classes. You have to import the classes in org.apache.spark.ml package not the mllib package.

In our dataset first 3 columns (RowNumber, CustomerId, Surname) are not useful as these columns has personal customer information which isn’t going to help us in classification modelling so we are not going to use them. The last columns Exited is the dependent variable which we’ll be predicting once our model is built.

Also there are some Categorical variables namely Gender and Geography so we’ll have to encode them to Double values so that our machine learning algo can utilize them.

def parseValues(value : String) : Double = { value match { case "Male" | "France" => 0.0 case "Female" | "Spain" => 1.0 case "Germany" => 2.0 case default => value.toDouble } } def parseLine(line : String) = { var fields = line.split(",") var vector = fields.slice(4,13).map(parseValues) LabeledPoint(parseValues(fields(13)), Vectors.dense(vector)) }

As you can see we have encoded string variables to Double values and also skipped first 3 columns.

Before we move to the training phase there is one more thing that we need to do, scale our features to avoid any unnecessary skew in our model. As you can see that EstimateSalary column has huge values as compared to other columns and this might diminish the effect of other variables on the dependent variable. We’ll use StandardScaler to scale our features as below.

val scaler = new StandardScaler() .setInputCol("features") .setOutputCol("scaledFeatures") .setWithMean(true) .setWithStd(true) val scalerModel = scaler.fit(rdd) val data = scalerModel.transform(rdd)

Now we can start the training phase. We’ll split the data in training and test set (75%, 25%)

val splits = data.randomSplit(Array(0.75, 0.25), seed = 1234L) val train = splits(0) val test = splits(1)

Now we’ll start defining our ANN. We’ll use an ANN with the following structure [9–5–5–5–5–5–2]. This means our ANN will have 9 node input layer (because we have 9 independent variables in our dataset), 5 hidden layer with each layer having 5 nodes and an output layer of 2 nodes ( since we only have 2 output 0 or 1)

There is no best practice as to how many hidden layers should your ANN have. A general good rule is it should be average of total Input and output nodes so in our case 9 + 2 / 2 = 5.5, hence we are using 5 layers with 5 nodes each.

val layers = Array[Int](9, 5, 5, 5, 5, 5, 2) val trainer = new MultilayerPerceptronClassifier() .setLayers(layers) .setBlockSize(128) .setSeed(1234L) .setMaxIter(100) // train the model val model = trainer.fit(train)

And we have a model. We can now use this model to predict values on our test set and then compare the prediction with actual values to calculate the accuracy.

// compute accuracy on the test set val result = model.transform(test) val predictionAndLabels = result.select("prediction", "label") val evaluator = new MulticlassClassificationEvaluator() .setMetricName("accuracy") println("Test set accuracy = " + evaluator.evaluate(predictionAndLabels))

And bingo. I got an accuracy of around 79%. I got around 86% of accuracy when I used same dataset in H2O using R. So why less accuracy rate ? Well turns out I was using “Rectifier” activation function in H2O but the Spark ML’s MultilayerPerceptronClassifier implements the “Sigmoid” activation function and doesn’t let you choose any other function.

Well for starters this isn’t too shabby either. For the complete script you can checkout my github repo here.

In case you have any questions or doubts feel free to comment below, I’ll be happy to help you out.

Till then Happy Machine Learning 🙂

View all posts by kaysush

Published August 25, 2017April 26, 2018


Originally published at sushilkumar.xyz on August 25, 2017.

)
Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade