Machine Learning with PySparkMlib And Keras — Solving a Chronic kidney Disease Problem

5 min readApr 26, 2019

This article is a continuation of my related articles related to PysparkMlib.Spark, defined by its creators is a fast and general engine for large-scale data processing.

The fast part means that it’s faster than previous approaches to work with Big Data like classical MapReduce. The secret for being faster is that Spark runs on Memory (RAM), and that makes the processing much faster than on Disk.

The general part means that it can be use for multiple things, like running distributed SQL, create data pipelines, ingest data into a database, run Machine Learning algorithms, work with graphs, data streams and much more.

Some of you at this point after so many blogs related to pyspark mlib may ask, profoundly so, ‘Why do this?’ Well, the simple answer to this can be demonstrated by a little pop quiz:

You have a nice computing cluster, populated with 256 nodes, 8 CPU’s per node, 16 CPU cores per CPU and 4 hyper-threads per core. What is the maximum possible number of concurrently running parallel threads?

Image retrieved from : https://systemml.apache.org/

The answer to this is of course, <insert drum-roll here> : 131 072 simultaneously running parallel threads, each doing part of your work! (= 256 nodes * 8 CPU’s per node * 16 CPU cores per CPU * 4 hyper-threads per core). Hence in this manner, Apache spark provides a open-source distributed general-purpose cluster-computing framework which allows you to manipulate your data and perform computations in parallel.

Exploring The Data

We will use the same data set when we built a Logistic Regression with pyspark , and it is related to chronic disease problem. The classification goal is to predict whether the person has kidney disease or not.I will discuss the details of dataset in my previous blog.The main purpose of this blog is to explore keras in pyspark MLIB.

In our dataframe, we have both numeric and categorical features in it. But to input the features in our machine learning model, we have to transform all categorical attribute to the numeric ones by indexing them. Either it is our input features or our label column for the model, we have to do it to train our model.

For the input features of our model, name the categorical features and transformed them:

In the above lines of code, we just name those features that are categorical and transformed them into numeric variables. Remember, that we didn’t overwrite the features, instead, we created new attributes by concatenating the name of previous features and the string “Index”. So that we can input only those features that we need for the training of model and keep the real one intact.

For the label column of our data frame that is ‘class’:

2) Typecasting of Features

In PySpark dataframe, we have to mention the data types of the continuous feature attribute. For all the numeric variable that are not discrete, we have to typecast them to later input them in a machine learning model.

In above lines of code, we typecast our numeric features to Double type.

3) Assembling of Input Features

In this step, we actually assemble all the features we need to input in a model. We have to provide the list of those type cast numeric features and those transformed categorical attributes and make a vectored feature.

4) Normalization of Input Features

As we can observe that all of our input features are not on the same scale, so the recommended approach is to first normalize our input features then fed them into the model for the better result.

5) Distribution of Dataset

As we prepared our input features PySpark dataframe, now it is the right time to define our training and testing dataset to train our model on sufficient training dataset and then use unseen or test dataset to evaluate the performance of our Logistic Regression model later.

6) Configuration of the Keras Model:

Below is the snippet of my keras code which ellaborates relu and sigmoid activation function more easily as sigmoid is only use in outer layer and relu is used in hidden layers.

Sigmoid activation

sigmoid is the right choice when :

The function is differentiable.That means, we can find the slope of the sigmoid curve at any two points.

The function is monotonic but function’s derivative is not.

Because we want to produce a probability that whether the person has a disease or not (i.e. binary classification), we can use a sigmoid activation layer.

The Sigmoid Function curve looks like a S-shape.

ReLU (Rectified Linear Unit) Activation Function

The ReLU is the most used activation function in the world right now.Since, it is used in almost all the convolutional neural networks or deep learning.

As you can see, the ReLU is half rectified (from bottom). f(z) is zero when z is less than zero and f(z) is equal to z when z is above or equal to zero.

Range: [ 0 to infinity)

The function and its derivative both are monotonic.

Further i will use binary crossentropy as a loss function and adam as optimizer.

7) Train Keras Model:

After training on 320 samples and validate on 80 samples i got accuracy of

0.6 😆

So congratulations, we have successfully built the keras model on chronic kidney disease dataset from the exploratory data analysis to the evaluation of machine learning model using PySpark MLlib covering all the aspects of machine learning pipeline. Here is the link to complete PySpark machine learning github repository.

I tried my best to deliver all the knowledge that is in my brain regarding the implementation of PySpark machine learning model in Python. If we enjoyed this blog (that I hope so: P), hit the like button ❤.