Spark Machine Learning in Clojure

Although Spark is implemented in Scala, we can use the Spark platform for machine learning tasks thanks to the Sparkling library.

In this worksheet, we’ll demonstrate the use Sparkling on classification tasks in machine learning. Lets load the dependencies first

(ns itchy-garden
(:require [sparkling.conf :as conf]
[sparkling.core :as s]
[ :as m]
[ :as cl]
[ :as xf]
[ :as v])
  (:import [ JavaSparkContext]
[org.apache.spark.sql DataFrame SQLContext]
[ NaiveBayes LogisticRegression
DecisionTreeClassifier RandomForestClassifier GBTClassifier ]
[ File]))


Lets load a libsvm format dataset. This consists of about 3k instances, where each instance has a target variable with a value 0 or 1, and multiple continuous valued features.

Note that Spark’s implementation of ML libraries only works with positive labels, and datasets with labels in the 1,-1 range do not work out of the box, the labels need to be changed to positive.

(defn download-dataset
(let [svm-dataset-path
tmpfile (.getPath (File/createTempFile "svmguide" "svm"))
_ (spit tmpfile (slurp svm-dataset-path))]
(def dataset-path (download-dataset))

Cross validation

We’ll train a binary classifier on this dataset and evaluate its performance using cross validation, on the areaUnderROC curve metric.

This API uses the middleware pattern (inspired by Ring) to specify these artifacts:

  • The handler function does the cross validation
  • The add-dataset function takes a function which returns the dataset in a Spark DataFrame
  • The estimator is the type of classifier or regressor used. In this case, we’ll use a logistic regression classifier.
  • The evaluator in this case is a binary classification evaluator. If we have a multi-class classification problem, a multi class evaluator will apply. The default metric for classification is area under the ROC curve.

The run-pipeline function will execute the handler and return the cross validated areaUnderROC metric.

(let [cvhandler (-> m/cv-handler
(m/add-dataset (partial m/load-libsvm-dataset dataset-path))
(m/add-estimator cl/logistic-regression)
(m/add-evaluator v/binary-classification-evaluator))]
(m/run-pipeline cvhandler))
;;output: areaUnderROC score

The previous example used the default options for logistic regression, as well as the evaluator. In the next example, we’ll specify

  • A hyperparameter for the classifier (elastic-net-param)
  • A different evaluation metric for the evaluator (area under Precision recall curve)
(let [cvhandler (-> m/cv-handler
(m/add-dataset (partial m/load-libsvm-dataset dataset-path))
(m/add-estimator cl/logistic-regression {:elastic-net-param 0.01})
(m/add-evaluator v/binary-classification-evaluator {:metric-name "areaUnderPR"} ))]
(m/run-pipeline cvhandler))
;;output: area under Precision Recall

We could also change the parameters for the cross validation handler itself. Lets change the number of folds to 5 fold (from the default 3)

(let [cvhandler (-> (partial m/cv-handler {:num-folds 5})
(m/add-dataset (partial m/load-libsvm-dataset dataset-path))
(m/add-estimator cl/logistic-regression)
(m/add-evaluator v/binary-classification-evaluator))]
(m/run-pipeline cvhandler))

Train-validation split

Instead of using cross validation, we could use a single fold and split the train-test set in terms of a percentage.

We’ll use a train-validation handler instead of a cross validation handler, while the rest of the parameters is unchanged.

(let [tvhandler (-> m/tv-handler
(m/add-dataset (partial m/load-libsvm-dataset dataset-path))
(m/add-estimator cl/logistic-regression)
(m/add-evaluator v/binary-classification-evaluator))]
(m/run-pipeline tvhandler))

Lets specify a different ratio of training to validation, where the training set is 60% and rest is validation set.

(let [tvhandler (-> (partial m/tv-handler {:train-ratio 0.6} )
(m/add-dataset (partial m/load-libsvm-dataset dataset-path))
(m/add-estimator cl/logistic-regression)
(m/add-evaluator v/binary-classification-evaluator))]
(m/run-pipeline tvhandler))

Grid search

Often classifiers use multiple hyperparameters, and we would like to find the values of hyperparameters that give the best classification scores. SparkML provides a grid search functionality for the same. See link for grid search in Scikit-learn

We’ll assign a range of regularization values and find the cross validated scores for each value.

(defn addregularization
"sets the regularization parameters to search over"
[regparam est]
(v/param-grid [[(.regParam est) (double-array regparam)]]))
(let [cvhandler (-> m/cv-handler
;;search for the best value for the regularization parameter
(m/add-grid-search (partial addregularization [0.1 0.05 0.01]))
(m/add-evaluator v/binary-classification-evaluator)
(m/add-estimator cl/logistic-regression)
(m/add-dataset (partial m/load-libsvm-dataset dataset-path)))]
(m/run-pipeline cvhandler))
(0.9730089312309409 0.9773629797090069 0.9853807421182689)

The 3 scores correspond to the three values for regularization, and we can see that the last value (0.01) returns the best cross validated score on the AucROC metric.