Logistic Regression with Apache Spark

Happy ML

(λx.x)eranga
Effectz.AI
3 min readJul 1, 2019

--

Happy ML

This is the second part of my Happy ML blog series. In the previous post I talked about the machine learning basics and K-Means unsupervised machine learning algorithm. In this post I’m gonna discuss about Logistic Regression supervised machine learning algorithm with an example.

About Logistic Regression

Logistic Regression is a popular supervised machine learning algorithm which can be used predict a categorical response. It can be used to solve under classification type machine learning problems. Classification involves looking at data and assigning a class (or a label) to it. Usually there are more than one classes, when there are two classes(0 or 1) it identifies as Binary Classification.

In this post I’m gonna use Logistic Regression algorithm to build a machine learning model with Apache Spark.(if you are new to Apache Spark please find more informations for here). The Logistic Regression model builds a Binary Classifier model to predict student exam pass/fail result based on past exam scores. All the source codes which relates to this post available on the gitlab. Please clone the repo and continue the post.

Data set

The data set that I’m using to build the model have historical data of students. It contains their scores in first two exams and a label column which shows whether each student was able to pass the 3rd and final exam or not. The exam result dataset exists on the gitlab repo as .CSV file. Following is the structure/schema of single exam record.

Load data set

To build Logistic Regression model from this data set first we need to load this data set into spark DataFrame. Following is the way to do that. It load the data into DataFrame from .CSV file based on the schema.

Add feature column

We need to transform features on the DataFrame records(score1, score2 values on each record) into FeatureVector. In order to the features to be used by a machine learning algorithm this vector need to be added as a feature column into the DataFrame. Following is the way to do that with VectorAssembler.

Add label column

Next, we need add a label column to the DataFrame with the the values of result column(pass or fail - 1 or 0). StringIndexer can be used for that. It will return a new DataFrame by adding label column with the value of result column.

Build Logistic Regression model

Next we can build Logistic Regression model by defining maxIter, regParam and elasticNetParam. In order to train and test the model the data set need to be split into a training data set and a test data set. 70% of the data is used to train the model, and 30% will be used for testing.

The same model can use built with spark Pipeline. A Pipeline consists with sequence of Transformers and Estimators. A Transformer is a ML Pipeline component that transforms a DataFrame into another DataFrame by using the transform() function. StringIndexer, VectorAssembler are the transformers in our pipeline. Estimator is the learning algorithm that trains the data. Estimator implements a method fit(), which accepts a DataFrame and produces a machine learning Model. LogisticRegression is the estimator of the pipeline. Following is the way to build the same logistic regression model by using the pipeline.

Evaluate model

A common metric used to evaluate the accuracy of a Logistic Regression model is Area Under the ROC Curve(AUC). We can use the BinaryClasssificationEvaluator to obtain the AUC. It required two columns, label and prediction to evaluate the model.

Save model

The built Logistic Regression model can be persisted in to disk. A persisted model can be reload and use use later on a different spark application.

Use model

Finally the Logistic Regression model can use to detect the binary classifications of new data. Following example shows detecting the pass/fail status (classification) of the new students by using two past exam scores.

Reference

  1. https://www.hackerearth.com/practice/notes/samarthbhargav/logistic-regression-in-apache-spark/
  2. https://dzone.com/articles/streaming-machine-learning-pipeline-for-sentiment
  3. https://mapr.com/blog/predicting-breast-cancer-using-apache-spark-machine-learning-logistic-regression/
  4. https://medium.com/@dhiraj.p.rai/logistic-regression-in-spark-ml-8a95b5f5434c
  5. https://towardsdatascience.com/machine-learning-with-pyspark-and-mllib-solving-a-binary-classification-problem-96396065d2aa
  6. https://blogs.bmc.com/using-logistic-regression-scala-spark/?print=print

--

--