Logistic Regression model using PySpark MLlib — Solving a Chronic kidney Disease Problem

5 min readApr 26, 2019

In this blog, we will continue our series of PySpark machine learning pipeline. In our previous post, we have done our exploratory data analysis using PySpark dataframe on the supervised dataset and today we are going to learn that how we can train our machine learning model using PySpark MLlib to build the more robust and scalable machine learning pipeline in distributed systems.

PySpark MLlib is a machine-learning library. It is a wrapper over PySpark Core to do data analysis using machine-learning algorithms in Python. This article will give us the comprehensive end to end overview of how to ingest the training data into the machine learning pipeline to predict the chronic kidney disease using PySpark MLlib and then later evaluate our machine learning model using unseen data to predict its performance in testing environment.

Before starting the article, please note that this is the continuation of my previous article in which I have done all the exploratory data analysis with all the preprocessing done to clean the data. So, in this article, we will focus on building PySpark Logistic Regression model to predict chronic Kidney disease and to evaluate it using testing data.

Have a look at a table of contents.

1) Encoding of Categorical variable

2) Typecasting of Features

3) Assembling of Input Features

4) Normalization of Input Features

5) Distribution of Dataset

6) Configuration of the Logistic Regression Model

7) Definition of Machine Learning pipeline

8) Train Logistic Regression Model

9) Prediction via Logistic Regression Model

10) Evaluation of Testing Data

Let’s deep dive into this exploratory PySpark MLlib blog

1) Encoding of Categorical variable

In our dataframe, we have both numeric and categorical features in it. But to input the features in our machine learning model, we have to transform all categorical attribute to the numeric ones by indexing them. Either it is our input features or our label column for the model, we have to do it to train our model.

For the input features of our model, name the categorical features and transformed them:

In the above lines of code, we just name those features that are categorical and transformed them into numeric variables. Remember, that we didn’t overwrite the features, instead, we created new attributes by concatenating the name of previous features and the string “Index”. So that we can input only those features that we need for the training of model and keep the real one intact.

For the label column of our data frame that is ‘class’:

2) Typecasting of Features

In PySpark dataframe, we have to mention the data types of the continuous feature attribute. For all the numeric variable that are not discrete, we have to typecast them to later input them in a machine learning model.

In above lines of code, we typecast our numeric features to Double type.

3) Assembling of Input Features

In this step, we actually assemble all the features we need to input in a model. We have to provide the list of those type cast numeric features and those transformed categorical attributes and make a vectored feature.

4) Normalization of Input Features

As we can observe that all of our input features are not on the same scale, so the recommended approach is to first normalize our input features then fed them into the model for the better result.

5) Distribution of Dataset

As we prepared our input features PySpark dataframe, now it is the right time to define our training and testing dataset to train our model on sufficient training dataset and then use unseen or test dataset to evaluate the performance of our Logistic Regression model later.

6) Configuration of the Logistic Regression Model

Before building the machine learning pipeline, we have to make some configuration of our machine learning model using PySpark MLlib to define the structure of Logistic Regression with some initial model parameter. It’s the important step before establishing a machine learning pipeline.

In the above lines of code, we have defined the name of input PySpark dataframe, the label column of the model and some model specific parameters.

7) Definition of Machine Learning pipeline

In the step of defining machine learning pipeline, we basically roll out all the stages we have prepared for the establishment of machine learning pipeline. For this purpose, we will take the instantiated logistic regression model and put this in our configured machine learning pipeline.

from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[lr])

8) Train Logistic Regression Model

For the training of our model, we will call the ‘fit’ function of a configured pipeline and then fed the training dataset as a function argument. Remember that this training dataset is different from the testing dataset, as we have split our dataset already.

model = pipeline.fit(df_train)

9) Prediction via Logistic Regression Model

Once our logistic regression model is trained on training dataset, we can predict on the same training dataset that how well our model is performed on the training dataset. For the prediction, we have to call the ‘transform’ function on the trained model and then it will give us the prediction from the model.

prediction = model.transform(df_train)

10) Evaluation of Testing Data

As we have trained our model and got the simple prediction of the training dataset. Now it’s the right absolute time to evaluate our machine learning model on testing data. The testing will basically tell us that in general whether this model is too good or too bad on unseen data.

So congratulations, we have successfully built the machine learning pipeline on chronic kidney disease dataset from the exploratory data analysis to the evaluation of machine learning model using PySpark MLlib covering all the aspects of machine learning pipeline. Here is the link to complete PySpark machine learning github repository.

I tried my best to deliver all the knowledge that is in my brain regarding the implementation of PySpark machine learning model in Python. If we enjoyed this blog (that I hope so: P), hit the like button ❤.

Logistic Regression model using PySpark MLlib — Solving a Chronic kidney Disease Problem

Table of Contents

1) Encoding of Categorical variable

2) Typecasting of Features

3) Assembling of Input Features

4) Normalization of Input Features

5) Distribution of Dataset

6) Configuration of the Logistic Regression Model

7) Definition of Machine Learning pipeline

8) Train Logistic Regression Model

9) Prediction via Logistic Regression Model

10) Evaluation of Testing Data

Written by Ayesha Shafique