Are we choosing the right evaluation metrics? How to choose it?

Most popular Evaluation metrics for Regression and Classification

Published in

ODSCJournal

7 min readJun 1, 2020

Hello World! My Blog for the Data Science Community. Most of the time we have a problem statement of various domains with more complexity of data, even after model creation choosing the right metrics is challenging. In this article, we are going to discuss the right metrics for regression and classification. If you are confused about using the right metrics, then this article is for you.

Regression Metrics

1. RMSE

RMSE is the sample standard deviation of the difference between Predicted values and observed values (called residuals or Errors). If the RMSE is smaller the model is better. Mathematically, it is calculated using this formula:

from sklearn import metrics
RMSE = np.sqrt(metrics.mean_squared_error(y_test,LR_Pred))
RMSE

2. MAE

MAE is the average of the absolute difference between the predicted values and observed value. It does not give any idea of the direction of the error i.e. whether we are under predicting the data or over predicting the data. If the MAE is smaller the model is better. Mathematically, it is calculated using this formula:

from sklearn import metrics
MAE=(metrics.mean_absolute_error(y_test, LR_Pred))
MAE

3. MSE

MSE is the average of the square of the difference between the original values and the predicted values. As we take a square of the error, the effect of larger errors(sometimes outliers) becomes a more pronounced then smaller error. The model will be penalized more for making predictions that differ greatly from the corresponding actual value. Mathematically, it is calculated using this formula:

from sklearn import metrics
MSE = (metrics.mean_squared_error(y_test,LR_Pred))
MSE

4. MAPE

MAPE is the absolute error normalized over the data, which allows the error to be compared across data with different scales. MAPE is the absolute error normalized over the actual value, computed for every data point, and then averaged. Mathematically, it is calculated using this formula:

def mean_absolute_percentage_error(y_true, y_pred):
    mean_absolute_percentage_error(y_true, y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

So what to choose and why?

[1] We can easily understand and we can interpret a thing that MAE takes an average of averages while RMSE will penalize the higher difference more than MAE.

let’s walk through an example

Consider if there are two statements with actual and predicted values,

Problem 1 : Actual Value = [2,3,4,5,6] , Predicted Values = [1,5,7,4,8]

Problem 1 : Actual Value = [5,1,3,9,6] , Predicted Values = [9,7,2,1,8]

MAE for Problem 1 = 5.2 and RMSE for Problem 1 = 5.2

MAE for Problem 1 = 7.5 and RMSE for Problem 1 = 7.65

So here we can observe that RMSE penalizes better than MAE. Probably RMSE will be more than MAE, both will be equal only when the difference between Actual and Predicted is zero.

[2] When the model is more complex and biased towards higher deviation, RMSE will be the default metric, because loss function defined in terms of RMSE is smoothly differentiable whereas MSE requires complicated linear programming to compute the gradient.

MSE & RMSE are really useful when you want to see if the outliers are messing with your predictions. It’s possible that you might decide to investigate those outliers and remove them altogether from your dataset.

R Squared (R²)

R Squared Used for explanatory purpose and explains how nicely your chosen independent variable(s) explains the variability in your dependent variable(s). R Squared is misunderstood and therefore I would like to clarify them first before going through their pros and cons.

Mathematically, R_Squared is given by:

from sklearn.metrics import r2_score
r2_score = r2_score(y_test,LR_Pred)
r2_score

Classification Metrics

1. Confusion Matrix

The Confusion matrix is 2 * 2 Matrix, one of the used for finding the correctness and accuracy of the model. It is used for the Classification problem where the output can be of two or more types of classes.

Before stepping into what the confusion matrix

Let’s say we are solving a classification problem where we are predicting whether a person is having cancer or not,

When a person is having cancer 0: When a person is NOT having cancer.

Let’s understand the terms in Confusion Matrix

True Positive

True positives are the cases when the actual class of the data point was 1and the predicted is also 1

Ex: The case where a person is actually having cancer(1) and the model classifying his case as cancer(1) comes under True positive.

True Negative

True negatives are the cases when the actual class of the data point was 0 and the predicted is also 0

Ex: The case where a person NOT having cancer and the model classifying his case as Not cancer comes under True Negatives.

False Positive

False Positives are the cases when the actual class of the data point was 0and the predicted is also 1

Ex: A person NOT having cancer and the model classifying his case as cancer comes under False Positives.

False Negative

False Negatives are the cases when the actual class of the data point was 1 and the predicted is also 0

Ex: A person having cancer and the model classifying his case as No-cancer comes under False Negatives.

from sklearn.metrics import confusion_matrix
confusion_matrix(y_pred,y_test)

PRECISION

The Precision or Specificity is a measure that tells us what proportion of patients that we diagnosed as having cancer, actually had cancer. The predicted positives (People predicted as cancerous are TP and FP) and the people actually having cancer are TP.

Ex: In our cancer example with 100 people, only 5 people have cancer. Let’s say our model is very bad and predicts every case as Cancer. Since we are predicting everyone as having cancer, our denominator(True positives and False Positives) is 100 and the numerator, person having cancer, and the model predicting his case as cancer is 5. So in this example, we can say that the Precision of such a model is 5%.

RECALL

The recall is a measure that tells us what proportion of patients that actually had cancer was diagnosed by the algorithm as having cancer. The actual positives (People having cancer are TP and FN) and the people diagnosed by the model having cancer are TP.

Ex: In our cancer example with 100 people, 5 people actually have cancer. Let’s say that the model predicts every case as cancer.

So our denominator(True positives and False Negatives) is 5 and the numerator, person having cancer, and the model predicting his case as cancer is also 5(Since we predicted 5 cancer cases correctly). So in this example, we can say that the Recall of such a model is 100%. And Precision of such a model(As we saw above) is 5%

When should we use Precision and Recall?

The Precision is about being precise. So even if we managed to capture only one cancer case, and we captured it correctly, then we are 100% precise.

The recall is not so much about capturing cases correctly but more about capturing all cases that have “cancer” with the answer as “cancer”. So if we simply always say every case as “cancer”, we have 100% recall.

F1 SCORE

F1 SCORE is the weighted average of Precision and Recall. We don’t really want to carry both Precision and Recall in our pockets every time we make a model for solving a classification problem. So it’s best if we can get a single score that kind of represents both Precision(P) and Recall(R).

Accuracy

Accuracy in classification problems is the number of correct predictions made by the model over all kinds of predictions made. Sometimes in imbalanced data Accuracy may mislead to the wrong decision.

ROC & AUC CURVE

What is ROC?

ROC (Receiver Operating Characteristic) Curve tells us about how good the model can distinguish between two things (e.g If a patient has cancer or no). Better models can accurately distinguish between the two. Whereas, a poor model will have difficulties in distinguishing between the two.

What is AUC?

The AUC (Area under the curve) is the area under the ROC curve. This score gives us a good idea of how well the model performs. AUC measures the entire two-dimensional area underneath the entire ROC curve.

Adjusted R-squared

Similar to R-squared, the Adjusted R-squared measures the variation in the dependent variable (or target), explained by only the features which are helpful in making predictions. Unlike R-squared, the Adjusted R-squared would penalize you for adding features that are not useful for predicting the target.

Let us mathematically understand how this feature is accommodated in Adjusted R-Squared. Here is the formula for adjusted r-squared:

Conclusion

In this blog, we have discussed various evaluation metrics. That’s it for this blog. The complete code can be found in this GitHub repo. I guess this helps as a starter for exploring other tests and methods.

If you liked the article, feel free to give me claps and help others to find it.

Also, let me know if I have missed out on anything in this concept

Connect with me in:- LinkedIn

Connect with me in:- Github