Logistic Regression with Apache Spark
Happy ML
Happy ML
This is the second part of my Happy ML
blog series. In the previous post I talked about the machine learning basics and K-Means
unsupervised machine learning algorithm. In this post I’m gonna discuss about Logistic Regression
supervised machine learning algorithm with an example.
About Logistic Regression
Logistic Regression
is a popular supervised machine learning algorithm which can be used predict a categorical response. It can be used to solve under classification
type machine learning problems. Classification involves looking at data and assigning a class (or a label) to it. Usually there are more than one classes, when there are two classes(0 or 1) it identifies as Binary Classification.
In this post I’m gonna use Logistic Regression algorithm to build a machine learning model with Apache Spark
.(if you are new to Apache Spark please find more informations for here). The Logistic Regression model builds a Binary Classifier model to predict student exam pass/fail result based on past exam scores. All the source codes which relates to this post available on the gitlab. Please clone the repo and continue the post.
Data set
The data set that I’m using to build the model have historical data of students. It contains their scores in first two exams and a label column which shows whether each student was able to pass the 3rd and final exam or not. The exam result dataset exists on the gitlab repo as .CSV
file. Following is the structure/schema of single exam record.
Load data set
To build Logistic Regression model from this data set first we need to load this data set into spark DataFrame
. Following is the way to do that. It load the data into DataFrame from .CSV
file based on the schema.
Add feature column
We need to transform features on the DataFrame records(score1
, score2
values on each record) into FeatureVector
. In order to the features to be used by a machine learning algorithm this vector need to be added as a feature column into the DataFrame. Following is the way to do that with VectorAssembler
.
Add label column
Next, we need add a label column to the DataFrame with the the values of result
column(pass or fail - 1 or 0). StringIndexer
can be used for that. It will return a new DataFrame by adding label column with the value of result column.
Build Logistic Regression model
Next we can build Logistic Regression model by defining maxIter
, regParam
and elasticNetParam
. In order to train and test the model the data set need to be split into a training data set and a test data set. 70% of the data is used to train the model, and 30% will be used for testing.
The same model can use built with spark Pipeline
. A Pipeline consists with sequence of Transformers
and Estimators
. A Transformer is a ML Pipeline component that transforms a DataFrame
into another DataFrame
by using the transform()
function. StringIndexer
, VectorAssembler
are the transformers in our pipeline. Estimator
is the learning algorithm that trains the data. Estimator implements a method fit()
, which accepts a DataFrame
and produces a machine learning Model
. LogisticRegression
is the estimator of the pipeline. Following is the way to build the same logistic regression model by using the pipeline.
Evaluate model
A common metric used to evaluate the accuracy of a Logistic Regression model is Area Under the ROC Curve
(AUC). We can use the BinaryClasssificationEvaluator
to obtain the AUC. It required two columns, label
and prediction
to evaluate the model.
Save model
The built Logistic Regression model can be persisted in to disk. A persisted model can be reload and use use later on a different spark application.
Use model
Finally the Logistic Regression model can use to detect the binary classifications of new data. Following example shows detecting the pass/fail status (classification) of the new students by using two past exam scores.
Reference
- https://www.hackerearth.com/practice/notes/samarthbhargav/logistic-regression-in-apache-spark/
- https://dzone.com/articles/streaming-machine-learning-pipeline-for-sentiment
- https://mapr.com/blog/predicting-breast-cancer-using-apache-spark-machine-learning-logistic-regression/
- https://medium.com/@dhiraj.p.rai/logistic-regression-in-spark-ml-8a95b5f5434c
- https://towardsdatascience.com/machine-learning-with-pyspark-and-mllib-solving-a-binary-classification-problem-96396065d2aa
- https://blogs.bmc.com/using-logistic-regression-scala-spark/?print=print