Gradient-boosted Tree classifier Model using PySpark

Ponshriharini
featurepreneur
Published in
2 min readMar 9, 2022

First, we’ll be creating a spark session and read the csv into a dataframe and print it’s schema

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('diabetes-model').getOrCreate()
df = spark.read.csv('diabetes.csv', header = True, inferSchema = True)
df.printSchema()

Schema basically gives details about the column name, it’s data type and whether it’s capable of holding null values or not.

Now, we convert this df into a pandas dataframe

df.toPandas()

We’ll need the columns with int values for prediction

numeric_features = [t[0] for t in df.dtypes if t[1] == 'int']
df.select(numeric_features).describe().toPandas().transpose()

Now, this df contains only the columns with int data type values

Here, we’ll also drop the unwanted columns — columns which doesn’t contribute to the prediction.

# Drop unnecessary columns
dataset = dataset.drop('SkinThickness')
dataset = dataset.drop('Insulin')
dataset = dataset.drop('DiabetesPedigreeFunction')
dataset = dataset.drop('Pregnancies')
dataset.show()

We’ll have to do something about the null values — either drop them or get the average and fill them. For this, we’ll first check for the null values in this dataframe

from pyspark.sql.functions import isnull, when, count, coldf.select([count(when(isnull(c), c)).alias(c) for c in df.columns]).show()

If we do find some null values, we’ll drop them

dataset = df.replace('null', None)\
.dropna(how='any')

We’ll now use VectorAssembler. As an overview, what is does is it takes a list of columns (features) and combines it into a single vector column (feature vector). It is then used as an input into the machine learning models in Spark Machine Learning.

features = ['Glucose','BloodPressure','BMI','Age']from pyspark.ml.feature import VectorAssemblervector = VectorAssembler(inputCols=features, outputCol='features')transformed_data = vector.transform(dataset)transformed_data.show()

Create training and testing data

(training_data, test_data) = transformed_data.randomSplit([0.8,0.2])

Here, we’ll be using Gradient-boosted Tree classifier Model and check it’s accuracy. Accuracy is the fraction of predictions our model got right.

from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
gb = GBTClassifier(labelCol = 'Outcome', featuresCol = 'features')
gbModel = gb.fit(training_data)
gb_predictions = gbModel.transform(test_data)

Here, we are first defining the GBTClassifier method and using it to train and test our model. It is a technique of producing an additive predictive model by combining various weak predictors, typically Decision Trees.

We’ll now get the accuracy of this model.

multi_evaluator = MulticlassClassificationEvaluator(labelCol = 'Outcome', metricName = 'accuracy')
print('Accuracy:', multi_evaluator.evaluate(gb_predictions))

Accuracy: 0.7643312101910829

We’ve now demonstrated the usage of Gradient-boosted Tree classifier and calculated the accuracy of this model.

Happy coding !

--

--