Model Evaluation Techniques in Machine Learning

5 min readJan 11, 2024

What is Model Evaluation?

Model evaluation is the process of using different evaluation metrics to understand a machine learning model’s performance, as well as its strengths and weaknesses.

Why is Evaluation necessary for a successful model?

Evaluation is necessary for ensuring that machine learning models are reliable, generalizable, and capable of making accurate predictions on new, unseen data, which is crucial for their successful deployment in real-world applications. Overfitting and underfitting are the two biggest causes of poor performance of machine learning algorithms.

Overfitting: Occurs when the model is so closely aligned to the training data that it does not know how to respond to new data.

Underfitting: Occurs when the model cannot adequately capture the underlying structure of the data.

Right Fit: Occurs when both the training data error and the test data are minimal

Evaluation Metrics

There are different metrics for the tasks of classification, regression, ranking, clustering, topic modeling, etc. Some of the metrics are as follows:

Classification Metrics (accuracy, precision, recall, F1-score, ROC, AUC, …)
Regression Metrics (MSE, MAE, R2)
Ranking Metrics (MRR, DCG, NDCG)
Statistical Metrics (Correlation)
Computer Vision Metrics (PSNR, SSIM, IoU)
NLP Metrics (Perplexity, BLEU score)
Deep Learning Related Metrics (Inception score, Frechet Inception distance)

→ Today, we will talk about Classification Metrics.

1. Classification Metrics

When our target is categorical, we are dealing with a classification problem. The choice of the most appropriate metrics depends on different aspects, such as the characteristics of the dataset, whether it’s imbalanced or not, and the goals of the analysis.

Confusion Matrix

A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.

I can summarize as before and after happenings. How?

As you see we have 2 main situations. Predicted (Before), Actual Values (After).

Predicted: Negative & Actual Value: Positive → Your predicted False (FN)
Predicted: Negative & Actual Value: Negative → Your predicted True(TN)
Predicted: Pozitive & Actual Value: Positive → Your predicted True (TP)
Predicted: Pozitive & Actual Value: Negative→ Your predicted False (FP)

These four scenarios are illustrated in the following figure.

Four possible combinations of reality and our binary pregnancy test results

Example Label for Accuracy, Precision, and Recall

True Positive (TP) =10

True Negative (TN)=12

False Positive (FP)=1

False Negative (FN)=2

Accuracy

Accuracy is one metric for evaluating classification models. Formally accuracy could be defined as the number of correct predictions to a total number of predictions.

Precision

Precision is a measure of the accuracy.

Recall

Recall is the true positive rate

F1 Score

F1 score is a machine learning evaluation metric that measures a model’s accuracy. It combines the precision and recall scores of a model.

The accuracy metric computes how many times a model made a correct prediction across the entire dataset.

In some scenarios, precision and recall may have varying levels of importance depending on the specific requirements of the application. The F1 score, which balances both precision and recall, may not perfectly capture the relative importance of these metrics for a given task. F1 score or seeing the PR or ROC curve can help.

ROC

ROC curve provides a comprehensive view of a model’s ability to discriminate between classes, especially in binary classification tasks. It helps in understanding the trade-offs between sensitivity and specificity at different decision thresholds, and the AUC offers a single metric for summarizing the overall performance of the model.

True Positive Rate (Recall)
False Positive Rate (FPR)

By Python;

# Given values
TP = 10
TN = 12
FP = 1
FN = 2

# Calculate metrics
accuracy = (TP + TN) / (TP + TN + FP + FN)
precision = TP / (TP + FP)
recall = TP / (TP + FN)
f1_score = 2 * (precision * recall) / (precision + recall)
fpr = FP / (FP + TN)

# Print results
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1_score)
print("False Positive Rate (FPR):", fpr)

Accuracy: 0.88
Precision: 0.9090909090909091
Recall: 0.8333333333333334
F1 Score: 0.8695652173913043
False Positive Rate (FPR): 0.07692307692307693

I hope it will be a useful article for you. If you stayed with me until the end, thank you for reading! :)

Source:

Overfitting

Learn how to avoid overfitting of machine learning and deep learning models. Resources include videos, examples, and…

www.mathworks.com

Machine Learning Evaluation Metrics: Theory and Overview - KDnuggets

High-level exploration of evaluation metrics in machine learning and their importance.

www.kdnuggets.com

https://www.v7labs.com/blog/f1-score-guide#:~:text=F1%20score%20is%20a%20machine%20learning%20evaluation%20metric%20that%20measures,prediction%20across%20the%20entire%20dataset.