Gradient Boost model using PySpark MLlib — Solving a Chronic kidney Disease Problem

Ayesha Shafique
4 min readApr 26, 2019

--

Apache Spark, once a component of the Hadoop ecosystem, is now becoming the big-data platform of choice for enterprises. It is a powerful open source engine that provides real-time stream processing, interactive processing, graph processing, in-memory processing as well as batch processing with very fast speed, ease of use and standard interface.

In the industry, there is a big demand for a powerful engine that can do all of above. Sooner or later, your company or your clients will be using Spark to develop sophisticated models that would enable you to discover new opportunities or avoid risk. Spark is not hard to learn, if you already known Python and SQL, it is very easy to get started. Let’s give it a try today!

Exploring The Data

We will use the same data set when we built a Logistic Regression with pyspark , and it is related to chronic disease problem. The classification goal is to predict whether the person has kidney disease or not.To proceed onward you must have to check my previous blog related to chronic disease i have given the very detail instructions related how to clean data, find mean mode

1) Encoding of Categorical variable

In our dataframe, we have both numeric and categorical features in it. But to input the features in our machine learning model, we have to transform all categorical attribute to the numeric ones by indexing them. Either it is our input features or our label column for the model, we have to do it to train our model.

For the input features of our model, name the categorical features and transformed them:

In the above lines of code, we just name those features that are categorical and transformed them into numeric variables. Remember, that we didn’t overwrite the features, instead, we created new attributes by concatenating the name of previous features and the string “Index”. So that we can input only those features that we need for the training of model and keep the real one intact.

For the label column of our data frame that is ‘class’:

2) Typecasting of Features

In PySpark dataframe, we have to mention the data types of the continuous feature attribute. For all the numeric variable that are not discrete, we have to typecast them to later input them in a machine learning model.

In above lines of code, we typecast our numeric features to Double type.

3) Assembling of Input Features

In this step, we actually assemble all the features we need to input in a model. We have to provide the list of those type cast numeric features and those transformed categorical attributes and make a vectored feature.

4) Normalization of Input Features

As we can observe that all of our input features are not on the same scale, so the recommended approach is to first normalize our input features then fed them into the model for the better result.

5) Distribution of Dataset

As we prepared our input features PySpark dataframe, now it is the right time to define our training and testing dataset to train our model on sufficient training dataset and then use unseen or test dataset to evaluate the performance of our Logistic Regression model later.

6) Configuration of the Gradient-Boosted Tree Classifier

To understand GRADIENT BOOSTING first of all you must have the basic knowledge of bagging and boosting .boosting is done in sequential manner below figure clearly explains the process of boosting.

In a nut shell Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. (Wikipedia definition)

Distributed Learning of Ensembles

The objective of any supervised learning algorithm is to define a loss function and minimize it. GBTs must train one tree at a time, training is only parallelized at the single tree level. Decision Trees are usually trained by selecting from all features at each decision node in the tree, Random Forests often limit the selection to a random subset of features at each node. MLlib’s implementation takes advantage of this subsampling to reduce communication: e.g., if only 1/3 of the features are used at each node, then we can reduce communication by a factor of 1/3.

lets deep dive into the code 💇

from pyspark.ml.classification import GBTClassifiergbt = GBTClassifier(labelCol="label", featuresCol="features_norm", maxIter=10)

7) Built a pipeline on the classifier

from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[gbt])

model = pipeline.fit(df_train)
prediction = model.transform(df_train)
prediction.printSchema()

output:

root
|-- label: double (nullable = false)
|-- features_norm: vector (nullable = true)
|-- rawPrediction: vector (nullable = true)
|-- probability: vector (nullable = true)
|-- prediction: double (nullable = false)

hurray our predictions reaches the probility of 1 😊

from pyspark.ml.evaluation import MulticlassClassificationEvaluator
binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") .setPredictionCol("prediction").setLabelCol("label")

binEval.evaluate(prediction)

Ouput:1.0

now its time to run classifier on testing data

#test on testing data
prediction = model.transform(df_test)
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
binEval = MulticlassClassificationEvaluator().setMetricName("accuracy") .setPredictionCol("prediction").setLabelCol("label")

binEval.evaluate(prediction)

output: 0.855

Testing accuracy shows that our model is little bit overfitted.and its the only disadvantage of gradient boosting as compare to random forest.

I tried my best to deliver all the knowledge that is in my brain regarding the implementation of PySpark machine learning model in Python. If we enjoyed this blog (that I hope so: P), hit the like button ❤.

--

--