Is Your Model Reliable?

Published in

The Startup

9 min readDec 5, 2020

When you build a model, your intentions for it to be a good fit on your data. You should have good accuracy to ensure the robustness and reliability of your predictions. The reverse of this situation may have serious consequences in the future.

Here is when evaluation metrics comes in.

You need some operation that you can apply to your model, and improve your model according to the results. Classification, or regression model, we need to be sure about our model. With evaluation metrics, we can make the best decision.

In this article, I will mention 11 evaluation metrics. Maybe, you won’t need them all, and probably there are more evaluation metrics that I will write here in this article. But learning is never waste of time, and less is better than nothing.

Let’s get started!

1. CONFUSION MATRIX

Sources : https://towardsdatascience.com/confusion-matrix-for-your-multi-class-machine-learning-model-ff9aa3bf7826

Confusion matrix maybe the most used one in machine learning. It’s NxN matrix, where N is the number of target variables. A process that shows how right and how wrong the model is.

With the image above, It can be confusing at the first sight, cause it really is! But don’t worry, I will write as simple as I can.

Best way to learn such a complex topic like this will be with example. The table above, point out the predicted, and actual values of number of people who use a social media platform to spend their time. Let’s open the explanations below through this example.

True Positive (TP) : Person use social media to spend his/her time, and our model predicted right. Label him/her as 1.

TP = Predicted : Yes, and Actual : Yes = 100

True Negative (TN) : Person do not use social media to spend her/his time, and our model predicted right. Label her/him as 0.

TN = Predicted : No, and Actual : No = 50

False Negative (FN) : Person use social media to spend his/her time, and our model predicted wrong. Label her/him as 0. (Type II Error)

FN = Predict : No, and Actual : Yes = 5

False Positive (FP) : Person do not use social media to spend her/his time, and our model predicted wrong. Label him/her as 1. (Type I Error)

FP = Predict : Yes, and Actual : No = 10

Accuracy : Shows how often the classifier is correct? True predictions divided by the total number of predictions.

(100 + 50) / (100 + 50 + 10 + 5) = 0.91

You probably know or used accuracy before. So, you might ask, why do we need these things above while we can go with accuracy score. Well, accuracy is not always reliable. Especially, when you have imbalanced data. Think, we have 900 A social media platform user, and 100 B social media platform user. We tell our classification model to predict people who use A social media platform. Since, 900 out of 1000 people use A platform, models accuracy score will be %90. Great! But when we give it an another class to predict like C platform, it will fail. Or make it predict the people who use B platform. 100 out of 1000 people, accuracy will be %10. Predictions were good but number of the class was not enough. That’s why we should use confusion matrix.

Implementation w/ Sklearn

metrics.confusion_matrix(y_test, y_pred)

2. PRECISION

Precision is the ratio between the True Positives and all the positives.

Answers this question: how often our model predicted correct?

Let’s calculate it for our data:

100 / (100 + 10) = 0.91

3. RECALL (SENSITIVITY)

Also known as ‘True Positive Rate’.

Recall and precision definitions are highly confuse by people. My goal is to make you understand it once and for all.

Recall and precision has one small difference. While precision, calculated for predicted values, recall, calculated for actual values.

Answers this question: What proportion of actual positives was identified correctly?

100 / (100 +5) = 0.95

4. SPECIFICITY

Also known as ‘False Positive Rate’.

Answers this question: What proportion of actual negatives was identified correctly?

50 / (50 + 10) = 0.83

You can check out the table above.

5. F1 SCORE

Do we need to check recall and precision separately to decide? Or we can decide with one value?

Answer is F1 Score. F1 score is the harmonic mean of precision, and recall. If F1 score is 1, then you have good news. If F1 score is 0, sorry. You should try again.

But, why harmonic mean?

Harmonic mean attends to skew towards to small outliers. It reduces the influence of large outliers and magnifies the influence of small outliers.

2 x ((0.95 x 0.91) / (0.95 + 0.91)) = 0.929

Implementation w/ Sklearn

metric.f1_score(y_test, y_pred)

6. ROC (Receiver Operating Characteristic Curve) & AUC (Area Under ROC Curve)

ROC

The ROC curve graphically presents the relationship between true positive rate (sensitivity) and false positive rate (specificity). An ROC curve plots TPR vs. FPR at different classification thresholds.

AUC

AUC stands for ‘Area Under ROC Curve’. As you can understand from the definition, AUC specify the area under the ROC curve. The closer the area is to 1, the better.

Implementation w/ Sklearn

There are two ways of implementing ROC curve. First one is the traditional, and long one. Second one is the easy, and the modern one. I will show you both.

Imagine like we have a data about something. We want to build a model with logistic regression.

# Model buildinglog = LogisticRegression()
log.fit(X_train, y_train)# Predict probabilities and get the predicted probability for the positive label onlyy_proba_log = log.predict_proba(X_test)[:, 1]# Calculate the ROC curvefpr, tpr, threshold = sklearn.metrics.roc_curve(y_test, y_proba_log)# Plottingplt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label = 'Logistic Regression'
plt.xlabel('fpr')
plt.ylabel('tpr')
plt.title('ROC Curve')
plt.legend();

Output will be like this:

sklearn.metrics.roc_auc_score(y_test, y_proba_log)

You can find the whole code in here : https://www.kaggle.com/denizbektas/churn-prediction#Logistic-Regression

Second, and easy one is:

metrics.plot_roc_curve(log, X_train, y_train)

Output will be the same as above.

7. LOG LOSS

It measures the performance of a classification model where the prediction input is a probability value between 0 and 1. The goal of any machine learning model is to minimize this value. Smaller log loss is better, with a perfect model having a log loss of 0.

Log loss is negative average of the log of corrected predicted probabilities for each instance.

Where:

p(yi) is predicted probability of positive class
1-p(yi) is predicted probability of negative class
yi = 1 for positive class and 0 for negative class (actual values).

An example:

log_loss(1, 0.9) = 0.105

Implementation w/ Sklearn

metrics.log_loss(y_test, y_pred)

8. GAIN AND LIFT CHARTS

Gain, and lift charts is good way to see your predictions success by visualization. It focuses on predicted values rather than looking at results as a whole, like a confusion matrix. The higher the lift, the better the model.

It is one of the most common uses is in marketing, to decide if a prospective client is worth calling.
Lift charts are often shown as a cumulative lift chart, which is also known as a gains chart. Therefore, gains charts are sometimes (perhaps confusingly) called “lift charts”, but they are more accurately cumulative lift charts.

9. GINI COEFFICIENT

The Gini coefficient measures the inequality among values of a frequency distribution. It takes values between 0 and 1. 0 indicates a perfect equality and 1 indicates a perfect inequality.

Gini coefficient can be computed with ROC curve:

Gini Coefficient = (2 * ROC_curve) — 1

10. KOLOMOGOROV SMİRNOV CHART

It measures the performance of classification model. K-S is a measure of the degree of separation between the positive and negative distributions.

K-S test compares your data’s distribution with a known probability distribution. Like normal distribution.

The K-S value is 100, if the model divides the data into two groups, one of which contains all positives and the other contains all negatives. If it doesn’t separate negatives and positives, the model is making a random selection. The K-S value is 0.

In most classification models, the K-S value is between 0 and 100. The closer to 100, means the better model.

11. ROOT MEAN SQUARE ERROR (RMSE)

RMSE is the most popular evaluation metric used in regression problems. RMSE is the square root of the average of the square of the difference between the actual values and the predicted values. It is a metric that measures the size of the error.

The more data we have, the more reliable the RMSE is.
RMSE is very sensitive to outliers. Remove outliers in your data before calculating this value.

Implementation w/ Sklearn

from sklearn.metrics import mean_squared_error

rms = mean_squared_error(y_actual, y_predicted, squared=False)

My other articles you’ll love:

Nasıl Veri Bilimci Olabilirim?

Eğer veri bilimci olmayı benim gibi kafaya koyduysanız Google arama kutucuğuna yazdığınız ilk şey bu soru olmuştur…

medium.com

Linear Regression

Our goal is to find a relationship between variables in machine learning. We have many algorithms to use for every use…

medium.com

Your Guide for Logistic Regression with Titanic Dataset

Next machine learning algorithm we’ll be talking about is logistic regression (also called Sigmoid Function). As I…

medium.com

REFERENCES

Precision vs. Recall - An Intuitive Guide for Every Machine Learning Person

Overview Precision and recall are two crucial yet misunderstood topics in machine learning We'll discuss what precision…

www.analyticsvidhya.com

Simple guide to confusion matrix terminology

A confusion matrix is a table that is often used to describe the performance of a classification model (or…

www.dataschool.io

Classification: Precision and Recall | Machine Learning Crash Course

"type": "thumb-down", "id": "missingTheInformationINeed", "label":"Missing the information I need" },{ "type"…

developers.google.com

11 Important Model Evaluation Metrics for Machine Learning Everyone should know

Overview Evaluating a model is a core part of building an effective machine learning model There are several evaluation…

www.analyticsvidhya.com

Model Evaluation Metrics in Machine Learning - KDnuggets

Predictive models have become a trusted advisor to many businesses and for a good reason. These models can "foresee the…

www.kdnuggets.com

AUC-ROC Curve in Machine Learning Clearly Explained

AUC-ROC Curve - The Star Performer! You've built your machine learning model - so what's next? You need to evaluate it…

www.analyticsvidhya.com

Understanding And Interpreting Gain And Lift Charts

In today's post, we will attempt to understand the logic behind generating a gain chart and then discuss how gain and…

www.datasciencecentral.com

Kolmogorov-Smirnov Goodness of Fit Test - Statistics How To

Assumption of Normality > Kolmogorov-Smirnov Test Contents: The Kolmogorov-Smirnov Goodness of Fit Test (K-S test)…

www.statisticshowto.com

What does RMSE really mean?

Root Mean Square Error (RMSE) is a standard way to measure the error of a model in predicting quantitative data…

towardsdatascience.com

Root-mean-square deviation

The root-mean-square deviation ( RMSD) or root-mean-square error ( RMSE) is a frequently used measure of the…

en.wikipedia.org

Is there a library function for Root mean square error (RMSE) in python?

If you understand RMSE: (Root mean squared error), MSE: (Mean Squared Error) RMD (Root mean squared deviation) and RMS…

stackoverflow.com

Is Your Model Reliable?

1. CONFUSION MATRIX

Implementation w/ Sklearn

2. PRECISION

3. RECALL (SENSITIVITY)

4. SPECIFICITY

5. F1 SCORE

Implementation w/ Sklearn

6. ROC (Receiver Operating Characteristic Curve) & AUC (Area Under ROC Curve)

ROC

AUC

Implementation w/ Sklearn

7. LOG LOSS

Implementation w/ Sklearn

8. GAIN AND LIFT CHARTS

9. GINI COEFFICIENT

10. KOLOMOGOROV SMİRNOV CHART

11. ROOT MEAN SQUARE ERROR (RMSE)

Implementation w/ Sklearn

Nasıl Veri Bilimci Olabilirim?

Eğer veri bilimci olmayı benim gibi kafaya koyduysanız Google arama kutucuğuna yazdığınız ilk şey bu soru olmuştur…

Linear Regression

Our goal is to find a relationship between variables in machine learning. We have many algorithms to use for every use…

Your Guide for Logistic Regression with Titanic Dataset

Next machine learning algorithm we’ll be talking about is logistic regression (also called Sigmoid Function). As I…

REFERENCES

Precision vs. Recall - An Intuitive Guide for Every Machine Learning Person

Overview Precision and recall are two crucial yet misunderstood topics in machine learning We'll discuss what precision…

Simple guide to confusion matrix terminology

A confusion matrix is a table that is often used to describe the performance of a classification model (or…

Classification: Precision and Recall | Machine Learning Crash Course

"type": "thumb-down", "id": "missingTheInformationINeed", "label":"Missing the information I need" },{ "type"…

11 Important Model Evaluation Metrics for Machine Learning Everyone should know

Overview Evaluating a model is a core part of building an effective machine learning model There are several evaluation…

Model Evaluation Metrics in Machine Learning - KDnuggets

Predictive models have become a trusted advisor to many businesses and for a good reason. These models can "foresee the…

AUC-ROC Curve in Machine Learning Clearly Explained

AUC-ROC Curve - The Star Performer! You've built your machine learning model - so what's next? You need to evaluate it…

Understanding And Interpreting Gain And Lift Charts

In today's post, we will attempt to understand the logic behind generating a gain chart and then discuss how gain and…

Kolmogorov-Smirnov Goodness of Fit Test - Statistics How To

Assumption of Normality > Kolmogorov-Smirnov Test Contents: The Kolmogorov-Smirnov Goodness of Fit Test (K-S test)…

What does RMSE really mean?

Root Mean Square Error (RMSE) is a standard way to measure the error of a model in predicting quantitative data…

Root-mean-square deviation

The root-mean-square deviation ( RMSD) or root-mean-square error ( RMSE) is a frequently used measure of the…

Is there a library function for Root mean square error (RMSE) in python?

If you understand RMSE: (Root mean squared error), MSE: (Mean Squared Error) RMD (Root mean squared deviation) and RMS…

Written by Güldeniz Bektaş