Practical Apache Spark in 10 minutes. Part 4 — MLlib

The vast possibilities of artificial intelligence are of increasing interest in the field of modern information technologies. One of its most promising and evolving directions is machine learning (ML), which becomes the essential part in various aspects of our life. ML has found successful applications in Natural Languages Processing, Face Recognition, Autonomous Vehicles, Fraud detection, Machine vision and many other fields.

Machine learning utilizes the mathematical algorithms that can solve specific tasks in a way analogous to the human brain. Depending on the neural network training method, ML algorithms can be divided into supervised (with labeled data), unsupervised (with unlabeled data), semi-supervised (there are both labeled and unlabeled data in the dataset) and reinforcement (based on reward receiving) learning. Solving the most basic and popular ML tasks, such as classification and regression, is mainly based on supervised learning algorithms. Among the variety of existing ML tools, Spark MLlib is a popular and easy-to-start library which enables training neural networks for solving the problems mentioned above.

In this post, we would like to consider classification task. We will classify Iris plants to the 3 categories according to the size of their sepals and petals. The public dataset with Iris classification is available here. To move forward, download the file bezdekIris.data to the working folder.


Important! Make sure that this file will be saved to the Spark folder. The folder name will be spark-2.3.0-bin-hadoop2.7 (depending on the Spark version you have downloaded).


Different ML algorithms can be implemented for solving this task. But, since we need multiclass classification, we have chosen the following three machine learning algorithms:

  1. Decision tree classifier
  2. Random forest classifier
  3. Naive Bayes

Load Data

Before launching pyspark, you need to install numpy module:

pip install numpy

And now you can launch pyspark by executing:

./bin/pyspark

The main API for MLlib is DataFrame based. So we need to load our data as a Dataframe. It is really a comma-separated file, so we can load it as a regular csv.

df = spark.read.csv("bezdekIris.data", inferSchema=True)\
.toDF("sep_len", "sep_wid", "pet_len", "pet_wid", "label")

The obtained DataFrame will look like the following:

df.show(5)
+-------+-------+-------+-------+-----------+
|sep_len|sep_wid|pet_len|pet_wid| label|
+-------+-------+-------+-------+-----------+
| 5.1| 3.5| 1.4| 0.2|Iris-setosa|
| 4.9| 3.0| 1.4| 0.2|Iris-setosa|
| 4.7| 3.2| 1.3| 0.2|Iris-setosa|
| 4.6| 3.1| 1.5| 0.2|Iris-setosa|
| 5.0| 3.6| 1.4| 0.2|Iris-setosa|
+-------+-------+-------+-------+-----------+
only showing top 5 rows

We have to perform some transformations to join all feature columns into a single column using VectorAssembler. To do this, firstly we should make two more imports:

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

To transform 4 features columns into one column with features vector, execute:

vector_assembler = VectorAssembler(\
inputCols=["sep_len", "sep_wid", "pet_len", "pet_wid"],\
outputCol="features")
df_temp = vector_assembler.transform(df)
df_temp.show(3)
+-------+-------+-------+-------+-----------+-----------------+
|sep_len|sep_wid|pet_len|pet_wid| label| features|
+-------+-------+-------+-------+-----------+-----------------+
| 5.1| 3.5| 1.4| 0.2|Iris-setosa|[5.1,3.5,1.4,0.2]|
| 4.9| 3.0| 1.4| 0.2|Iris-setosa|[4.9,3.0,1.4,0.2]|
| 4.7| 3.2| 1.3| 0.2|Iris-setosa|[4.7,3.2,1.3,0.2]|
+-------+-------+-------+-------+-----------+-----------------+
only showing top 3 rows

Let’s remove unnecessary columns:

df = df_temp.drop('sep_len', 'sep_wid', 'pet_len', 'pet_wid')
df.show(3)
+-----------+-----------------+
| label| features|
+-----------+-----------------+
|Iris-setosa|[5.1,3.5,1.4,0.2]|
|Iris-setosa|[4.9,3.0,1.4,0.2]|
|Iris-setosa|[4.7,3.2,1.3,0.2]|
+-----------+-----------------+
only showing top 3 rows

At this point, we have a dataframe with all necessary data in the appropriate form. Now we should index labels, i.e., convert textual representation to a numeric one with the help of StringIndexer. To do this, execute the following commands:

from pyspark.ml.feature import StringIndexer
l_indexer = StringIndexer(inputCol="label", outputCol="labelIndex")
df = l_indexer.fit(df).transform(df)

As a result, we have one more column named labelIndex:

df.show(3)
+-----------+-----------------+----------+
| label| features|labelIndex|
+-----------+-----------------+----------+
|Iris-setosa|[5.1,3.5,1.4,0.2]| 0.0|
|Iris-setosa|[4.9,3.0,1.4,0.2]| 0.0|
|Iris-setosa|[4.7,3.2,1.3,0.2]| 0.0|
+-----------+-----------------+----------+
only showing top 3 rows

After label indexing, we should divide our data into training and test sets (30% held out for testing):

(trainingData, testData) = df.randomSplit([0.7, 0.3])

Now we can apply to our data machine learning algorithms which we have chosen at the beginning of the article. Let’s begin with Decision tree classifier.

Decision tree classifier

Decision tree classifier is one of the widely used and convenient algorithms for classification and regression tasks in machine learning.

To begin working with this algorithm, we need to make some new imports:

from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

After importing, we will train a DecisionTree model using following commands:

dt = DecisionTreeClassifier(labelCol="labelIndex", featuresCol="features")
model = dt.fit(trainingData)

Finally, we can make some predictions using our trained model:

predictions = model.transform(testData)

To look at the result, execute the following command:

predictions.select("prediction", "labelIndex").show(5)
+----------+----------+
|prediction|labelIndex|
+----------+----------+
| 0.0| 0.0|
| 0.0| 0.0|
| 0.0| 0.0|
| 0.0| 0.0|
| 0.0| 0.0|
+----------+----------+
only showing top 5 rows

To estimate the accuracy of the prediction, the test error should be computed :

evaluator = MulticlassClassificationEvaluator(\
labelCol="labelIndex", predictionCol="prediction",\
metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g " % (1.0 — accuracy))
Test Error = 0.0392157

Please, note, that this number may vary, and you’ll see a different value, as data is being split randomly.

Now we can print a small model summary:

print(model)
DecisionTreeClassificationModel (uid=DecisionTreeClassifier_4924b7bce7df8343300e) of depth 5 with 15 nodes

Having considered the Decision tree classification model, one can conclude that this model makes rather good predictions. But there are algorithms based on decision tree ensembles which are more powerful. One of them is Random forest classifier.

Random forest classifier

Decision tree ensembles are among the most popular algorithms for the classification tasks. Due to the combination of many decision trees, Random forest classifier has a lower risk of overfitting. DataFrame-based API for ML supports random forests for both binary and multiclass classification. For this algorithm, we will use the same training and test data as in the previous case.

from pyspark.ml.classification import RandomForestClassifier

To train a RandomForest model, execute next commands:

rf = RandomForestClassifier(labelCol="labelIndex",\
featuresCol="features", numTrees=10)
model = rf.fit(trainingData)

Now we can make predictions:

predictions = model.transform(testData)

And as a next step let’s take a look at the result:

predictions.select("prediction", "labelIndex").show(5)
+----------+----------+
|prediction|labelIndex|
+----------+----------+
| 0.0| 0.0|
| 0.0| 0.0|
| 0.0| 0.0|
| 0.0| 0.0|
| 0.0| 0.0|
+----------+----------+
only showing top 5 rows

Now, let’s evaluate our model by computing the test error:

evaluator =\
MulticlassClassificationEvaluator(labelCol="labelIndex",\
predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 — accuracy))
Test Error = 0.0196078

Again, numbers may vary for each other user.

And obtain some information about our model:

print(model)
RandomForestClassificationModel (uid=RandomForestClassifier_4eb29128b895e4d43d40) with 10 trees

As you can see, random forest classifier gives more accurate prediction than Decision tree classifier with 2% less test error.

Naive Bayes classifier

Naive Bayes classifier is one of the most straightforward multiclass classification algorithms, which can be applied to the multiclass classification task, with the assumption of independence between every pair of features. It can be trained very efficiently. For this algorithm, we will use the same previously prepared dataframe as in the previous models, but we will change the division between training and test data in the following way:

splits = df.randomSplit([0.6, 0.4], 1234)
train = splits[0]
test = splits[1]

Now we can import NaiveBayes classifier and apply it to our data:

from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes(labelCol="labelIndex",\
featuresCol="features", smoothing=1.0,\
modelType="multinomial")
model = nb.fit(train)

To look at the result type the commands:

predictions = model.transform(test)
predictions.select("label", "labelIndex",
"probability", "prediction").show()

To evaluate this classifier, let’s compute the accuracy on the test set:

evaluator =\
MulticlassClassificationEvaluator(labelCol="labelIndex",\
predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print(“Test set accuracy = “ + str(accuracy))
Test set accuracy = 0.9259259259259259

Again, results may vary from time to time.

As we can see, Naive Bayes classifier gives results comparable to other approaches but has a little lower accuracy.

Conclusions

Machine Learning algorithms are widely used in many fields of our life, for example, image recognition, natural language processing, fraud detection, etc. There are many instruments for the ML algorithm implementation, among which Apache Spark is one of the most popular ones. It has a special MLib library, which implements some methods of machine learning. In this post, we have considered the main stages of data preparation for further usage with MLib library. Having applied three different ML algorithms to the public Iris classification dataset, one can conclude that all of them give good accuracy (more than 90%). But it is necessary to say that the Random forest classifier gives the best accuracy which is about 98%. This can be caused by the fact that this algorithm combines many decision trees to decrease the risk of overfitting.

Originally published in datascience-school.com