Performance Metrics and Their Types

Sandeep Painuly
7 min readDec 28, 2021

--

Hi folks! Great to see you again. This is one of the most important topics which in my opinion every aspiring ML engineer should know. This article is on based on Performance metrics or in other words, how you can evaluate the performance of your baby ML model. But you might be thinking why this topic, right? Let me provide some insights. When we are new in this Machine learning universe we always try to be in a hurry while building the model but we forget the most important thing that is the evaluation of that model.

HOW GOOD IS YOUR MODEL ?? “

Once being part of the same group, I have seen a lot of analysts and aspiring data scientists ignoring and not caring to check how robust their model is. Once they are finished building a model, they hurriedly map predicted values on unseen data. In my opinion, this is an inaccurate approach. Simply building a predictive model should not be your motive. It is about creating and selecting a model which gives high accuracy out of sample data. Hence, it is crucial to check the accuracy of your model prior to computing predicted values.

WHY PERFORMANCE METRICS ?

As we are taught from childhood that whatever we do in our life whether it is in studies, sports and in any other activities, how well we did is evaluated from the performance/results, so similarly in Machine learning how good our model is evaluated with performance metrics.

It is just like a report card of our model so that we can see what we did wrong and how we can change our model to get good results. Performance metrics give us a clear picture about what we have made as a model.

WHAT ARE PERFORMANCE METRICS ?

Machine learning metrics are used to understand how well the machine learning model performed on the input data that was supplied to it. This way, the performance of the model can be improved by tuning the hyper parameters or tweaking features of the input data set. The main goal of a learning model is to generalize well on never seen before data. Performance metrics help in determining how well the model generalizes on new data.

There is no set rule of how to choose performance metrics and not even we can use all metrics on a single model. By using different metrics for performance evaluation, we should be in a position to improve the overall predictive power of our model before we roll it out for production on unseen data.

Without doing a proper evaluation of the ML model using different metrics, and depending only on accuracy, it can lead to a problem when the respective model is deployed on unseen data and can result in poor predictions and HELL more of a re-work!

This happens because, in cases like these, our models don’t learn but instead memorize; hence, they cannot generalize well on unseen data.

BUILDING UP THE BASE

Talking in terms of predictive models, we are mostly talking either about a regression model (continuous output) or a classification model (nominal or binary output). So we have different metrics used for different types of models.

In classification problems, we use two types of algorithms (dependent on the kind of output it creates):

  1. Class output: Algorithms like SVM and KNN create a class output. For instance, in a binary classification problem, the outputs will be either 0 or 1. However, today we have algorithms which can convert these class outputs to probability.
  2. Probability output: Algorithms like Logistic Regression, Random Forest, Gradient Boosting, Adaboost etc. give probability outputs. Converting probability outputs to class output is just a matter of creating a threshold probability.

In regression problems, we do not have such inconsistencies in output. The output is always continuous in nature and requires no further treatment.

DIFFERENT TYPE OF PERFORMANCE METRICS

1. Confusion Matrix

A confusion matrix is a matrix representation of the prediction results of any binary testing that is often used to describe the performance of the classification model (or “classifier”) on a set of test data for which the true values are known.

Each prediction can be one of the four outcomes, based on how it matches up to the actual value:

True Positive (TP): Predicted True and True in reality.

True Negative (TN): Predicted False and False in reality.

False Positive (FP): Predicted True and False in reality.

False Negative (FN): Predicted False and True in reality.

a.) Precision :- It tells about the number of data points (predictions) which were correctly predicted as true that were actually true/correct:

Precision = (True positive) / (True Positive + False Positive)

b.) Recall

It tells about the number of data points (predictions) which were actually relevant in a data set:

Recall = (True positive) / (True Positive + False Negative)

Sample Model with Code example

Constructing Confusion Matrix as below :-

2. Accuracy

Overall, how often is the classifier correct?

Accuracy = (TP+TN)/total

When our classes are roughly equal in size, we can use accuracy, which will give us correctly classified values.

Accuracy is a common evaluation metric for classification problems. It’s the number of correct predictions made as a ratio of all predictions made.

3. F1-Score

In practice, when we try to increase the precision of our model, the recall goes down, and vice-versa. The F1-score captures both the trends in a single value:

Confusion Matrix F1-score

F1-score is a harmonic mean of Precision and Recall, and so it gives a combined idea about these two metrics. It is maximum when Precision is equal to Recall. But there is a catch here. The interpretability of the F1-score is poor. This means that we don’t know what our classifier is maximizing — precision or recall? So, we use it in combination with other evaluation metrics which gives us a complete picture of the result.

4. Receiver Operating Characteristics (ROC) Curve

Measuring the area under the ROC curve is also a very useful method for evaluating a model. By plotting the true positive rate (sensitivity) versus the false-positive rate (1 — specificity), we get the Receiver Operating Characteristic (ROC) curve. This curve allows us to visualize the trade-off between the true positive rate and the false positive rate.

And it gives us this curve as follows

In the example above, the AUC is relatively close to 1 and greater than 0.5. A perfect classifier will have the ROC curve go along the Y-axis and then along the X-axis.

5. Log Loss

Log Loss is the most important classification metric based on probabilities.

Below is the code to get the above graph i.e. Log Loss

It measures the performance of a classification model where the prediction input is a probability value between 0 and 1. Log loss increases as the predicted probability diverges from the actual label. The goal of any machine learning model is to minimize this value. As such, smaller log loss is better, with a perfect model having a log loss of 0.

6. Root Mean Squared Error (RMSE)

RMSE is the most popular evaluation metric used in regression problems. It follows an assumption that error are unbiased and follow a normal distribution. As compared to mean absolute error, RMSE gives higher weightage and punishes large errors.

RMSE metric is given by:

where, N is Total Number of Observations.

7. R-Squared/Adjusted R-Squared

We learned that when the RMSE decreases, the model’s performance will improve. But these values alone are not intuitive.

In the case of a classification problem, if the model has an accuracy of 0.8, we could gauge how good our model is against a random model, which has an accuracy of 0.5. So the random model can be treated as a benchmark. But when we talk about the RMSE metrics, we do not have a benchmark to compare.

This is where we can use the R-Squared metric. The formula for R-Squared is as follows:

Conclusion

The ultimate purpose behind working with these evaluation metrics is understanding how well a machine learning model is going to perform on unseen data . Metrics like accuracy, precision, recall are good ways to evaluate classification models for balanced datasets, but if the data is imbalanced and there’s a class disparity, then other methods like ROC/AUC perform better in evaluating the model performance.

Well, this concludes this article. If you felt that this article has helped you to clear your doubts, do comment and give it a thumbs up. In the end, I hope you guys liked it. You can also find me on Linkedin and Twitter.

Lastly, do not hold any queries back and ask them in the comments :)

Thanks for reading !!! 🎉

--

--