Performance Metrics for Machine Learning Models

Sachin D N
Analytics Vidhya
Published in
14 min readAug 13, 2020

There are various metrics that we can use to evaluate the performance of ML algorithms, classification as well as regression algorithms. We must carefully choose the metrics for evaluating ML performance because,

  • How the performance of ML algorithms is measured and compared will be dependent entirely on the metric we choose.
  • How we weight the importance of various characteristics in the result will be influenced completely by the metric we choose.

The metrics that you choose to evaluate your machine learning model are very important. Choice of metrics influences how the performance of machine learning algorithms is measured and compared.

Contents

1.Performance Metrics for Classification Problems

2.Performance Metrics for Regression Problems

3.Distribution of Errors

Performance Metrics for Classification Problems

1. Accuracy

Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations.

As a heuristic, or rule of thumb, accuracy can tell us immediately whether a model is being trained correctly and how it may perform generally. However, it does not give detailed information regarding its application to the problem.

If we have high accuracy then our model is best. Yes, accuracy is a great measure but only when we have symmetric datasets where values of positive and negative classes are almost the same.

When data is imbalanced, Accuracy is not the best measure and accuracy cannot use the probability score.

Ex: In our Amazon food review sentiment analysis example with 100 reviews, 10 people have said the review is positive. Let’s assume our model is very bad and predicts every review is positive. For this, it has classified those 90 people negative reviews as positive and 10 positive reviews as negative reviews. Now even though the model is terrible at predicting reviews, The accuracy of such a bad model is also 90%.

2. Confusion Matrix

The Confusion matrix is one of the most intuitive and easiest metrics used for finding the correctness and accuracy of the model. It is used for the Classification problem where the output can be of two or more types of classes.

Confusion Matrix cannot process the probability score.

A confusion matrix is an N X N matrix, where N is the number of classes being predicted. For the problem in hand, if we have N=2, and hence we get a 2 X 2 matrix.

Let’s assume in our Amazon Food reviews label to our target variable:
1: When a person says the review is Positive.

0: When a person says the review is Negative.

The confusion matrix is a table with two dimensions (“Actual” and “Predicted”), and sets of “classes” in both dimensions. Our Actual classifications are rows and Predicted ones are Columns.

The Confusion matrix in itself is not a performance measure as such, but almost all of the performance metrics are based on the Confusion Matrix and the numbers inside it.

Explanation of the terms associated with confusion matrix are as follows,

True Negatives (TN) − It is the case when both the actual class and predicted class of data point is 0.

Ex: The case where a review is actually negative(0) and the model classifying this review as negative(0) comes under True Negative.

False Positives (FP) − It is the case when the actual class of data point is 0 and the predicted class of data point is 1.

Ex: The case where a review is actually negative(0) and the model classifying this review as positive(1) comes under False Positive.

False Negatives (FN) − It is the case when the actual class of data point is 1 and the predicted class of data point is 0.

Ex: The case where a review is actually positive(1) and the model classifying this review as negative(0) comes under False Negative.

True Positives (TP) − It is the case when both the actual class and predicted class of data point is 1.

Ex: The case where a review is actually positive(1) and the model classifying this review as positive(1) comes under True positive.

N is the total number of negatives in our given data and P is the total number of positives in our data shown in the below image.

Accuracy in terms of the confusion matrix given by in classification problems is the number of correct predictions made by the model over all kinds of predictions made.

We can use the accuracy_score function of sklearn.metrics to compute the accuracy of our classification model.

For a good model, the True Positive Rate and True Negative Should be high, and the False Positive Rate and False Negative Rate Should be low.

Some examples

False Positive (FP) moves a trusted email to junk in an anti-spam engine.
False Negative (FN) in medical screening can incorrectly show disease absence when it is actually positive.

Precision ,Recall (or) sensitivity ,Specificity,F1-Score

Precision and Recall are extensively used in information retrieval problems when we have a large corpus of text data.

Precision: Precision tells about us of all the points model predicted to be positive, what percentage of points are actually positive.

Precision is about being precise. So even if we managed to capture only one cancer case, and we captured it correctly, then we are 100% precise.

Recall (or) Sensitivity (or) True Positive Rate: Recall tells about us of all the points are actually belong to the positive points, how many points are to be predicted as positive points.

Recall is not so much about capturing cases correctly but more about capturing all cases that have “cancer” with the answer as “cancer”. So if we simply always say every case as “cancer”, we have 100% recall.

So basically if we want to focus more on minimizing False Negatives, we would want our Recall to be as close to 100% as possible without precision being too bad and if we want to focus on minimizing False positives, then our focus should be to make Precision as close to 100% as possible.

It is clear that recall gives us information about a classifier’s performance with respect to false negatives (how many did we miss), while precision gives us information about its performance with respect to false positives(how many did we caught).

Specificity (or) True Negative Rate: Specificity, in contrast, to recall, may be defined as the number of negatives returned by our ML model. We can easily calculate it by confusion matrix with the help of following formula

F1-Score: We don’t really want to carry both Precision and Recall in our pockets every time we make a model for solving a classification problem. So it’s best if we can get a single score that kind of represents both Precision(P) and Recall(R).

This score will give us the harmonic mean of precision and recall. Mathematically, the F1-Score is the weighted average of precision and recall. The best value of F1-Score would be 1 and the worst would be 0. F1-Score is having an equal relative contribution of precision and recall.

For Multi-Class Classification, we use similar metrics like F1-Score there are

Micro F1-Score: Micro F1-score (short for micro-averaged F1 score) is used to assess the quality of multi-label binary problems. It measures the F1-score of the aggregated contributions of all classes.

If you are looking to select a model based on a balance between precision and recall, don’t miss out on assessing your F1-scores.

Micro F1-score 1 is the best value (perfect micro-precision and micro-recall), and the worst value is 0. Note that precision and recall have the same relative contribution to the F1-score.

C is the Number of Classes and K ∈ C.

Micro F1-score is defined as the harmonic mean of the precision and recall:

Micro-averaging F1-score is performed by first calculating the sum of all true positives, false positives, and false negatives over all the labels. Then we compute the micro-precision and micro-recall from the sums. And finally, we compute the harmonic mean to get the micro F1-score.

Micro-averaging will put more emphasis on the common labels in the data set since it gives each sample the same importance. This may be the preferred behavior for multi-label classification problems.

Macro F1-Score: Macro F1-score (short for macro-averaged F1 score) is used to assess the quality of problems with multiple binary labels or multiple classes.

Macro F1-score is defined as the average harmonic mean of precision and recall of each class:

C is the Number of Classes and K ∈ C.

Macro F1-score will give the same importance to each label/class. It will be low for models that only perform well on the common classes while performing poorly on the rare classes.

Hamming Loss: Hamming loss is the fraction of wrong labels to the total number of labels. In multi-class classification, the hamming loss is calculated as the hamming distance between actual and predictions.

This is a loss function, so the optimal value is zero.

3. Receiver Operating Characteristics Curve

Receiver-operating characteristic (ROC) analysis was originally developed during World War II to analyze classification accuracy in differentiating signals from noise in radar detection. Recently, the methodology has been adapted to several clinical areas heavily dependent on screening and diagnostic tests, in particular, laboratory testing, epidemiology, radiology, and bioinformatics.

A Receiver Operating Characteristic (ROC) Curve is a way to compare diagnostic tests. It is a plot of the True Positive Rate against the False Positive Rate.

AUC (Area Under Curve)-ROC (Receiver Operating Characteristic) is a performance metric, based on varying threshold values, for classification problems. As the name suggests, ROC is a probability curve, and AUC measures the separability.

ROC is used in binary Classification. To compute ROC first we want to do

  • Getting classification model probability predictions. The probabilities usually range between 0 and 1.
  • AUC cannot care about the probability score it only cares about the sorted order of data.
  • Sort the data into discrete order.
  • The next step is to find a threshold to classify the probabilities.
  • To plot the ROC curve, we need to calculate the TPR and FPR for different thresholds using a confusion matrix.
  • For each threshold, we plot the FPR value in the x-axis and the TPR value in the y-axis. We then join the dots with a line.

In simple words, the AUC-ROC metric will tell us about the capability of the model in distinguishing the classes. Higher the AUC, the better the model.

The ROC curve is plotted with TPR against the FPR where TPR is on the y-axis and FPR is on the x-axis.

An excellent model has AUC near to the 1 which means it has a good measure of separability. A poor model has AUC near to the 0 which means it has the worst measure of separability. In fact, it means it is reciprocating the result. It is predicting 0s as 1s and 1s as 0s. And when AUC is 0.5, it means the model has no class separation capacity whatsoever.

4. Log probability (Log Loss)

Log Loss is the most important classification metric based on probabilities.

If the model gives us the probability score, Log-loss is the best performance measure for both binary and Multi classification.

The goal of our machine learning models is to minimize this value. A perfect model would have a log loss of 0.

It’s hard to interpret raw log-loss values, but log-loss is still a good metric for comparing models. For any given problem, a lower log-loss value means better predictions.

Log loss quantifies the average difference between predicted and expected probability distributions.

Performance Metrics for Regression Problems

1. R² or Coefficient of Determination

It is known as the coefficient of determination. It is a statistical measure of how close the data are to the fitted regression line Or indicates the goodness of fit of a set of predictions to the actual values. The value of R² lies between 0 and 1 where 0 means no-fit and 1 means perfectly-fit.

R-squared is calculated by dividing the sum of squares of residuals (SSres) from the regression model by the total sum of squares (SStot) of errors from the average model and then subtract it from 1.

Where SSE is the Sum of Square of Residuals. Here residual is the difference between the predicted value and the actual value.it is also called an error.

And SST is the Total Sum of Squared error using a simple mean model.

An R-squared value of 0.81, tells that the input variables explain 81 % of the variation in the output variable. The higher the R squared, the more variation is explained by the input variables and better is the model.

2. Adjusted R²

The limitation of R-squared is that it will either stay the same or increases with the addition of more variables, even if they do not have any relationship with the output variables.

To overcome this limitation, Adjusted R-square comes into the picture as it penalizes you for adding the variables which do not improve your existing model.

Adjusted R² depicts the same meaning as R² but is an improvement of it. R² suffers from the problem that the scores improve on increasing terms even though the model is not improving which may misguide the researcher. Adjusted R² is always lower than R² as it adjusts for the increasing predictors and only shows improvement if there is a real improvement.

Hence, if you are building Linear regression on multiple variables, it is always suggested that you use Adjusted R-squared to judge the goodness of the model.

3. MEAN SQUARE ERROR (MSE)

MSE or Mean Squared Error is one of the most preferred metrics for regression tasks. It is simply the average of the squared difference between the target value and the value predicted by the regression model.

As it squares the differences, it penalizes even a small error which leads to over-estimation of how bad the model is. It is preferred more than other metrics because it is differentiable and hence can be optimized better.

Here, the error term is squared and thus more sensitive to outliers.

3. ROOT MEAN SQUARE ERROR (RMSE)

RMSE is the most widely used metric for regression tasks and is the square root of the averaged squared difference between the target value and the value predicted by the model.

MSE includes squared error terms, we take the square root of the MSE, which gives rise to Root Mean Squared Error (RMSE).

RMSE is highly affected by outlier values. Hence, make sure you’ve removed outliers from your data set prior to using this metric.

4. Mean Absolute Error (MAE)

It is the simplest error metric used in regression problems. It is basically the sum of the average of the absolute difference between the predicted and actual values.

In simple words, with MAE, we can get an idea of how wrong the predictions were

5. Median Absolute Deviation Error(MADE)

Median Absolute Deviation: The median absolute deviation(MAD) is a robust measure of how to spread out a set of data is. The variance and standard deviation are also measures of spread, but they are more affected by extremely high or extremely low values and non-normality.

First, find the median of error, then subtract this median from each error, then take the absolute value of these differences, Find the median of these absolute differences.

Distribution of Errors

To understand errors, for every point compute error and distribute in PDF and CDF.

The probability distribution for a random error that is as likely to move the value in either direction is called a Gaussian distribution.

In the above image, most of the errors are small, very few errors are large, smaller errors better for regression.

In the above image, 99 % of errors are < 0.1 and 1 % of errors are ≥ 0.1.

If we compare the errors of the two models, the red color model is M1 is having 95 % of errors below 0.1 and the blue color model is M2 is having 80% of errors are below 0.1. From this, we conclude the M1 is better than M2.

I Performed some tasks for Classification and Regression from the scratch show in the below image.

For complete code to understand visit my GitHub link.

For More Details Visit here

Conclusion

In this post, we discovered about the various metrics used in Classification and Regression analysis in Machine Learning.

References

  • Applied AI
  • Coursera
  • Data Camp
  • Tutorials point
  • Wikipedia

Thanks for reading and your patience. I hope you liked the post, let me know if there are any errors in my post. Let’s discuss in the comments if you find anything wrong in the post or if you have anything to add…

Happy Learning!!

--

--