Evaluating Binary Classification Models with PySpark

6 min readFeb 4, 2024

In the realm of data science, the ability to predict outcomes with precision is paramount. Imagine a scenario where we can predict whether an email is spam or not, whether a transaction is fraudulent, or whether a patient is likely to develop a certain medical condition. This is where binary classification models come into play, offering the power to make informed decisions based on data.

In this journey through the world of data and machine learning, we delve into the evaluation of a binary classification model using PySpark, a powerful framework for big data processing. PySpark provides a robust environment to build, train, and evaluate machine learning models at scale, making it an ideal choice for handling vast datasets.

The Foundation: Training a Logistic Regression Model

Our story begins with the training of a logistic regression model, a popular algorithm for binary classification tasks. We leverage the capabilities of PySpark’s MLlib, which seamlessly integrates with the Spark ecosystem, allowing us to scale our analyses effortlessly.

Link to data in GitHub: https://github.com/davutemrah/spark_repo/tree/master/data

Data

data_path = 'data/diabetes.csv'
df_diabetes=spark.read \
      .option("header","True")\
      .option("inferSchema","True")\
      .option("sep",",")\
      .csv(data_path)
#####################################################
print("There are", df_diabetes.count(),
      "rows", len(df_diabetes.columns),
      "columns in the data.") 
#####################################################
target_feature = 'Outcome'
indep_features = [col for col in df_diabetes.columns if col not in [target_feature]]

There are 768 rows 9 columns in the data.

Transform Data


from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler

# empty stages list for pipeline
stages_list = []

# Assemble features into a vector
assembler = VectorAssembler(inputCols=indep_features, outputCol="features")

# update stages list
stages_list += [assembler]

# Create a pipeline with the assembler and other transformations
pipeline = Pipeline(stages=stages_list)

# Fit the pipeline on the training data
df_pipeline= pipeline.fit(df_diabetes)

# Transform the training data
df_diabetes_transformed = df_pipeline.transform(df_diabetes)

# Split the transformed data into training and testing sets
train_data, test_data = df_diabetes_transformed.randomSplit([0.8, 0.2], seed=1234)

train_data.show(n=5)

+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|          6|    148|           72|           35|      0|33.6|                   0.627| 50|      1|
|          1|     85|           66|           29|      0|26.6|                   0.351| 31|      0|
|          8|    183|           64|            0|      0|23.3|                   0.672| 32|      1|
|          1|     89|           66|           23|     94|28.1|                   0.167| 21|      0|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
only showing top 4 rows

Logistic Regression Model Predictions

from pyspark.ml.classification import LogisticRegression

# Initialize the Logistic Regression model
lr = LogisticRegression(featuresCol="features", labelCol=target_feature, maxIter=10)\
    .fit(train_data)

# Make predictions on the test set
lr_predictions = lr.transform(test_data)

Unveiling the Metrics: Binary Classification Evaluator

As our model makes predictions, the next step is to assess its performance. PySpark provides a dedicated tool for this purpose — the BinaryClassificationEvaluator. This evaluator offers insights into key metrics, including the area under the Receiver Operating Characteristic (ROC) curve, area under the Precision-Recall curve, accuracy, precision, recall, and F1 score.

from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate the model
evaluator = BinaryClassificationEvaluator(labelCol=target_feature)
area_under_curve = evaluator.evaluate(lr_predictions)
print(f"Area under ROC curve: {area_under_curve}")

Area under ROC curve: 0.8359788359788357

ROC curve

The ROC curve showcases the trade-off between true positive rate and false positive rate, offering a visual representation of model performance. The area under this curve quantifies the model’s ability to discriminate between positive and negative instances.

trainingSummary = lr.summary
lrROC = trainingSummary.roc.toPandas()

plt.plot(lrROC['FPR'],lrROC['TPR'])
plt.ylabel('False Positive Rate')
plt.xlabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

print('Training set areaUnderROC: ' + str(trainingSummary.areaUnderROC))

Precision and recall

The Precision-Recall curve provides insights into precision and recall, crucial for scenarios where class imbalance is prevalent. Precision signifies the accuracy of positive predictions, while recall reflects the model’s ability to capture all positive instances.

pr = trainingSummary.pr.toPandas()
plt.plot(pr['recall'],pr['precision'])
plt.ylabel('Precision')
plt.xlabel('Recall')
plt.show()

Unveiling the Metrics Behind Model Performance

In the intricate landscape of machine learning, the success of a model is often measured by its ability to make accurate predictions. Amidst the myriad of evaluation metrics, a set of fundamental indicators plays a pivotal role in deciphering the true performance of a predictive model.

These metrics not only reveal the model’s proficiency in discerning between different classes but also shed light on its robustness in the face of varying scenarios.

True Positives (TP) and True Negatives (TN): The Foundation of Accuracy

At the heart of model evaluation lie two essential components — True Positives (TP) and True Negatives (TN). TP signifies instances where the model correctly predicts the positive class, while TN represents the accurate identification of the negative class. These elements serve as the bedrock for the calculation of accuracy, a metric that quantifies the overall correctness of the model’s predictions.

False Positives (FP) and False Negatives (FN): Navigating the Realm of Errors

Yet, the journey towards a comprehensive understanding of a model’s performance is not without its challenges. False Positives (FP) occur when the model predicts the positive class incorrectly, and False Negatives (FN) emerge when the model fails to identify instances of the positive class. These errors highlight the intricacies involved in striking a balance between sensitivity and specificity.

Accuracy: Gauging Overall Correctness

Accuracy, a metric often in the spotlight, encapsulates the model’s proficiency in making correct predictions across both positive and negative classes. It is a percentage that mirrors the harmony between correct predictions and the total number of instances.

Recall: Illuminating Sensitivity

In the pursuit of perfection, models are often evaluated for their recall, a metric that gauges sensitivity. Recall measures the ability of a model to identify all relevant instances of the positive class, providing insights into its comprehensiveness.

Precision: Unveiling Discrimination

Precision, on the other hand, delves into the discriminatory power of a model. It quantifies the accuracy of positive predictions, offering a glimpse into the model’s precision in distinguishing the positive class.

As we navigate through the intricacies of these metrics, we embark on a journey to unravel the true essence of model performance, where every TP, TN, FP, FN, accuracy, recall, and precision contributes to the narrative of predictive prowess.

# Calculate true positives, true negatives, false positives, false negatives
tp = lr_predictions.filter((col(target_feature) == 1) & (col('prediction') == 1)).count()
tn = lr_predictions.filter((col(target_feature) == 0) & (col('prediction') == 0)).count()
fp = lr_predictions.filter((col(target_feature) == 0) & (col('prediction') == 1)).count()
fn = lr_predictions.filter((col(target_feature) == 1) & (col('prediction') == 0)).count()

# Calculate accuracy
accuracy = (tp + tn) / (tp + tn + fp + fn)
print(f"Accuracy: {accuracy}")

# Calculate precision
precision = tp / (tp + fp) if (tp + fp) != 0 else 0  
print(f"Precision: {precision}")

# Calculate recall
recall = tp / (tp + fn) if (tp + fn) != 0 else 0.0  
print(f"Recall: {recall}")

# Calculate F1 measure
f1_measure = 2 * (precision * recall) / (precision + recall) if (precision + recall) != 0 else 0.0  
print(f"F1 measure: {f1_measure}")

Accuracy: 0.7393939393939394 
Precision: 0.6923076923076923 
Recall: 0.5714285714285714 
F1 measure: 0.6260869565217392

Alternative way to get Evaluation Metrics:

from pyspark.mllib.evaluation import MulticlassMetrics

# Convert DataFrame to RDD
prediction_and_label = lr_predictions\
                        .select(["prediction", target_feature])\
                        .withColumn(target_feature, col(target_feature).cast(FloatType()))\
                        .orderBy(target_feature)

# Create MulticlassMetrics object
metrics = MulticlassMetrics(prediction_and_label.rdd.map(tuple))

# metrics
print("Accuracy:", metrics.accuracy)
print("Precision:", metrics.precision(1.0))
print("Recall:", metrics.recall(1.0))
print("F1 measure:", metrics.fMeasure(1.0))

The Takeaway: Making Informed Decisions

As our evaluation metrics unveil the strengths and weaknesses of our binary classification model, we gain the power to make informed decisions. Whether the goal is to minimize false positives, maximize precision, or strike a balance between precision and recall, these metrics guide our path toward model optimization.

In the ever-evolving landscape of data science, PySpark emerges as a beacon, offering the tools to unravel the complexities of big data and machine learning. The journey through evaluating a binary classification model showcases the significance of robust frameworks and thoughtful metrics in the quest for actionable insights.