Is Your Model Reliable?

Güldeniz Bektaş
The Startup
Published in
9 min readDec 5, 2020

--

When you build a model, your intentions for it to be a good fit on your data. You should have good accuracy to ensure the robustness and reliability of your predictions. The reverse of this situation may have serious consequences in the future.

Here is when evaluation metrics comes in.

You need some operation that you can apply to your model, and improve your model according to the results. Classification, or regression model, we need to be sure about our model. With evaluation metrics, we can make the best decision.

In this article, I will mention 11 evaluation metrics. Maybe, you won’t need them all, and probably there are more evaluation metrics that I will write here in this article. But learning is never waste of time, and less is better than nothing.

Let’s get started!

1. CONFUSION MATRIX

Sources : https://towardsdatascience.com/confusion-matrix-for-your-multi-class-machine-learning-model-ff9aa3bf7826

Confusion matrix maybe the most used one in machine learning. It’s NxN matrix, where N is the number of target variables. A process that shows how right and how wrong the model is.

With the image above, It can be confusing at the first sight, cause it really is! But don’t worry, I will write as simple as I can.

Source : Data School

Best way to learn such a complex topic like this will be with example. The table above, point out the predicted, and actual values of number of people who use a social media platform to spend their time. Let’s open the explanations below through this example.

True Positive (TP) : Person use social media to spend his/her time, and our model predicted right. Label him/her as 1.

TP = Predicted : Yes, and Actual : Yes = 100

True Negative (TN) : Person do not use social media to spend her/his time, and our model predicted right. Label her/him as 0.

TN = Predicted : No, and Actual : No = 50

False Negative (FN) : Person use social media to spend his/her time, and our model predicted wrong. Label her/him as 0. (Type II Error)

FN = Predict : No, and Actual : Yes = 5

False Positive (FP) : Person do not use social media to spend her/his time, and our model predicted wrong. Label him/her as 1. (Type I Error)

FP = Predict : Yes, and Actual : No = 10

Accuracy : Shows how often the classifier is correct? True predictions divided by the total number of predictions.

(100 + 50) / (100 + 50 + 10 + 5) = 0.91

You probably know or used accuracy before. So, you might ask, why do we need these things above while we can go with accuracy score. Well, accuracy is not always reliable. Especially, when you have imbalanced data. Think, we have 900 A social media platform user, and 100 B social media platform user. We tell our classification model to predict people who use A social media platform. Since, 900 out of 1000 people use A platform, models accuracy score will be %90. Great! But when we give it an another class to predict like C platform, it will fail. Or make it predict the people who use B platform. 100 out of 1000 people, accuracy will be %10. Predictions were good but number of the class was not enough. That’s why we should use confusion matrix.

Implementation w/ Sklearn

metrics.confusion_matrix(y_test, y_pred)

2. PRECISION

Precision is the ratio between the True Positives and all the positives.

Answers this question: how often our model predicted correct?

Sources : Precision vs. Recall

Let’s calculate it for our data:

100 / (100 + 10) = 0.91

3. RECALL (SENSITIVITY)

Also known as ‘True Positive Rate’.

Recall and precision definitions are highly confuse by people. My goal is to make you understand it once and for all.

Recall and precision has one small difference. While precision, calculated for predicted values, recall, calculated for actual values.

Answers this question: What proportion of actual positives was identified correctly?

Sources : Precision vs. Recall

100 / (100 +5) = 0.95

4. SPECIFICITY

Also known as ‘False Positive Rate’.

Answers this question: What proportion of actual negatives was identified correctly?

50 / (50 + 10) = 0.83

Source : Analytics Vidhya

You can check out the table above.

5. F1 SCORE

Do we need to check recall and precision separately to decide? Or we can decide with one value?

Answer is F1 Score. F1 score is the harmonic mean of precision, and recall. If F1 score is 1, then you have good news. If F1 score is 0, sorry. You should try again.

But, why harmonic mean?

Harmonic mean attends to skew towards to small outliers. It reduces the influence of large outliers and magnifies the influence of small outliers.

Source : KDnuggets

2 x ((0.95 x 0.91) / (0.95 + 0.91)) = 0.929

Implementation w/ Sklearn

metric.f1_score(y_test, y_pred)

6. ROC (Receiver Operating Characteristic Curve) & AUC (Area Under ROC Curve)

ROC

The ROC curve graphically presents the relationship between true positive rate (sensitivity) and false positive rate (specificity). An ROC curve plots TPR vs. FPR at different classification thresholds.

Source :Google

AUC

AUC stands for ‘Area Under ROC Curve’. As you can understand from the definition, AUC specify the area under the ROC curve. The closer the area is to 1, the better.

Source : KDnuggets

Implementation w/ Sklearn

There are two ways of implementing ROC curve. First one is the traditional, and long one. Second one is the easy, and the modern one. I will show you both.

Imagine like we have a data about something. We want to build a model with logistic regression.

# Model buildinglog = LogisticRegression()
log.fit(X_train, y_train)
# Predict probabilities and get the predicted probability for the positive label onlyy_proba_log = log.predict_proba(X_test)[:, 1]# Calculate the ROC curvefpr, tpr, threshold = sklearn.metrics.roc_curve(y_test, y_proba_log)# Plottingplt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label = 'Logistic Regression'
plt.xlabel('fpr')
plt.ylabel('tpr')
plt.title('ROC Curve')
plt.legend();

Output will be like this:

Source : Kaggle
sklearn.metrics.roc_auc_score(y_test, y_proba_log)

You can find the whole code in here : https://www.kaggle.com/denizbektas/churn-prediction#Logistic-Regression

Second, and easy one is:

metrics.plot_roc_curve(log, X_train, y_train)

Output will be the same as above.

7. LOG LOSS

It measures the performance of a classification model where the prediction input is a probability value between 0 and 1. The goal of any machine learning model is to minimize this value. Smaller log loss is better, with a perfect model having a log loss of 0.

Log loss is negative average of the log of corrected predicted probabilities for each instance.

Where:

  • p(yi) is predicted probability of positive class
  • 1-p(yi) is predicted probability of negative class
  • yi = 1 for positive class and 0 for negative class (actual values).

An example:

log_loss(1, 0.9) = 0.105

Implementation w/ Sklearn

metrics.log_loss(y_test, y_pred)
Source : KDnugget

8. GAIN AND LIFT CHARTS

Gain, and lift charts is good way to see your predictions success by visualization. It focuses on predicted values rather than looking at results as a whole, like a confusion matrix. The higher the lift, the better the model.

Source : Data Science Central

It is one of the most common uses is in marketing, to decide if a prospective client is worth calling.

Lift charts are often shown as a cumulative lift chart, which is also known as a gains chart. Therefore, gains charts are sometimes (perhaps confusingly) called “lift charts”, but they are more accurately cumulative lift charts.

9. GINI COEFFICIENT

The Gini coefficient measures the inequality among values of a frequency distribution. It takes values between 0 and 1. 0 indicates a perfect equality and 1 indicates a perfect inequality.

Gini coefficient can be computed with ROC curve:

Gini Coefficient = (2 * ROC_curve) — 1

10. KOLOMOGOROV SMİRNOV CHART

It measures the performance of classification model. K-S is a measure of the degree of separation between the positive and negative distributions.

K-S test compares your data’s distribution with a known probability distribution. Like normal distribution.

The K-S value is 100, if the model divides the data into two groups, one of which contains all positives and the other contains all negatives. If it doesn’t separate negatives and positives, the model is making a random selection. The K-S value is 0.

In most classification models, the K-S value is between 0 and 100. The closer to 100, means the better model.

11. ROOT MEAN SQUARE ERROR (RMSE)

RMSE is the most popular evaluation metric used in regression problems. RMSE is the square root of the average of the square of the difference between the actual values and the predicted values. It is a metric that measures the size of the error.

Source : James Moody
  • The more data we have, the more reliable the RMSE is.
  • RMSE is very sensitive to outliers. Remove outliers in your data before calculating this value.

Implementation w/ Sklearn

from sklearn.metrics import mean_squared_error

rms = mean_squared_error(y_actual, y_predicted, squared=False)

My other articles you’ll love:

REFERENCES

--

--