Performance Metrics for Various Supervised Machine Learning Algorithms


Measuring the performance of your machine learning algorithm is One of the most important step that you want to do , for building a robust Machine Learning Model. After long time of Exploratory Data Analysis , Feature Engineering , Cross validation the next step definitely is to measure your model’s performance on the test set . Although there are different types of Performance Metrics available , it is as important to know which metric to use when as it is to know metrics itself.

In this blog I will walk through following.

  1. Different types of Performance metrics like Accuracy , Log Loss , Precision , Recall , F1-Score.

1. Accuracy

Let’s start with a metric which we are mostly familiar with since our childhood , Accuracy is something which is very easy to understand and grasp even for a non technical person.Let’s take an example , using which we can understand it more easily.

Suppose we are training a Machine Learning model to classify Reviews as ‘Positive’ or ‘Negative’ given review text . Our model will definitely make predictions on Test Set which will be unlabelled reviews. We can give formula for accuracy as.


As we can see it is very easy to make sense out of it and it is also very simple to understand , Now let’s talk about some of the loopholes in this metric.

Limitations of Accuracy:

  1. Biased towards Majority Class in case of Imbalanced Datasets :Suppose You are designing a credit card fraud detection system using machine learning algorithms and you want to predict whether transaction is fraud (labelled as 1) or not fraud(labelled as 0) , the training data given to you might be heavily imbalanced as 99% of transactions will be labelled as 0 and only 1% or so will be labelled as 1 (due to inherent nature of credit transactions). Then even if you build up a dumb model or not so good enough model which is more biased towards majority class , you might end up getting very high accuracy of 99 % or so but 1% inaccuracy might contribute to millions of dollars of fraud transaction which can lead to Heavy blow to the company.

2. Confusion Matrix

To understand the concept of confusion matrix let’s say we are solving binary classification problem where class labels belongs to {0,1} . 0 indicates negative label , 1 indicates positive label . As the name suggest we would have to draw a matrix , since it is binary classification problem we draw a 2*2 matrix.

Confusion Matrix

At first this matrix looks daunting but it’s simple to understand when we think of it from logical stand point and break it down into pieces.

What this matrix does it in each entries it stores count .

  1. True Negatives : It basically represents the count of data points which were actually negatives and our Machine Learning algorithm also correctly classified them to be negative.

As we can see , by plotting confusion matrix we are getting an idea of how well our model is predicting class labels as compared to original labels , this was not possible while using accuracy as it was solely relying on just a single number.

Now Let us Define Some of the other terms.

Total Number of Positives(P) = False Negatives(FN) + True Positives(TP)

Total Number of Negatives(N) = True Negatives (TN) + False Positives (FP)

  1. True Positive Rate(TPR) = TP / P

For an Ideal machine learning model True Positive and True Negatives should be as high as possible , but in real life scenario there are errors which are associated with each of the model , Hence it is good to have both TP and TN as high as possible , consequently both TPR and TNR should be as high as possible , However this entire discussion is very domain and application specific.

When to Plot Confusion Matrix ?

For Classification settings it is always important to plot confusion matrix as they give us more insights into how good our model is in predicting class labels.

3. Precision And Recall .

Precision and Recall are mostly used when we are concerned more about predicting a specific class.

Example : Suppose we want to predict whether a credit card transaction is fraudulent or not , then we might be more concerned about Fraudulent transaction as they are the ones affecting our business .

Let’s Label Fraudulent Transaction as 1 and Non Fraudulent as 0.



Precision is Basically given as

Precision = (TP)/(TP+FP)

What basically Precision is basically telling us of all points that our model predicted to be positive what percentage of them were actually positive. For our Credit card transaction use case this can be interpreted as “Of all Transactions that our model labelled as Fraudulent , How many of them were actually Fraudulent”.


Recall is simply ‘True Positive Rate’ (TPR)

Recall = (TP) / (TP+FN)

Recall is basically telling us of all the points that are actually positive what percentage of them are are correctly predicted as positive. For Our Credit Card use case this can be interpreted as “ Of all Fraudulent transactions that happened what percentage of them were correctly predicted as Fraud by our machine learning Model”

“For A Machine Learning Classification Model we want Both Precision and Recall to be as High as Possible”

When to use Precision ?

Precision = (True Positives)/(True Positives + False Positives)

Precision is good measure to use when cost of False Positive is High, for example in Spam detection use case False Positive Basically means , a Email is labelled as Spam whereas actually it is non Spam, Hence Because of this one user might loose useful information in the email .(Source :

When to use Recall ?

Recall = (True Positives)/(True Positives + False Negatives)

From similar argument we can derive that recall is good measure to use when cost of False Negatives is High . Example for credit card fraud detection use case one False Negative basically implies that a Credit card transaction is labelled as Non Fraudulent(0) , whereas in actual sense it is Fraudulent , This could lead to millions of dollars of lost to the companies.


Most of the times we want to represent Precision and Recall by using a single metric and that is F1-Score , F1 score is basically Harmonic Mean of Precision and Recall.

F1 Score = (2 * Precision * Recall) / (Precision + Recall)

When to use F1 Score ?

One problem with F1 Score is that it is not as interpretable as Precision and Recall , still it is better practice to use F1-Score in case of imbalanced datasets and typically when predicting one class is more important than other. For eg: In case of classifying a patient as cancerous or non cancerous it is more important to predict cancerous class , as even if the patient is non cancerous and we make a mistake of predicting him to be cancerous still it is better all we want is not to misclassify cancerous patient as non cancerous , in such cases it is better to use F1 Score.

Where to go from here ?

Although I have covered some basic evaluation metrics there are plenty out there which , you can explore when you encounter certain problem which requires that metrics ,

For example :

  1. All Kaggle competitions provide evaluation metric on which test data is evaluated.

You can find more extensive list of metrics here in the below link


  1. Applied AI Course.

If you liked the article and found it helpful , please hit the 👏 icon to support it :) . This will help other Medium users find it. Share it, so that others can read it. Thanks for stopping and reading by.

I am a Machine Learning , Deep Learning enthusiast who routinely reads Self Help Books , I would like to share my knowledge by writing blogs . Sky is the limit!