Machine Learning metrics -Binary Classification (Accuracy,Confusion matrix ,Precision,Recall,F1-Score,ROC and AUC)

MAITRIK DAS
8 min readDec 5, 2019

After having constructed the Machine Learning algorithms it is merely important to check its performance to understand the righteousness of choosing the proper models to see whether it is properly inclined with the needs of the solution of the problem or it is just randomly chosen .

First ,it comes to the Accuracy

For Machine Learning model’s Accuracy is as defined

Number of correctly Predicted points/Total number of predictions made .

Machine learning drives us to predict points based on the trained points and accuracy tells us how good our predictions are when We are comparing it with our total number of actual points . As an example For any machine learning model I may predict 100 of data points and amongst those 90 points are correctly classified so here accuracy would be 90/100=0.9 ,so this model has a good accuracy .

Now for Accuracy there are some cons , the basic problem with this tool is that it doesn’t get affected even by any imbalanced datasets ,For an Example in an analysis if we find our test sets are imbalanced in accumulation still we will see that model holds a very good accuracy which is obviously bad signal and secondly, for binary classifications when we interpret things with probability for good intuitions this metric won’t take probability into account . So to get rid of such problems the researchers looked into some other tools which I am going to mention right here .

Confusion matrix , the popular metric in evaluating binary classification results posses a lot more confusion in understanding and significantly as the word suggests so but actually it is very cool and of course an interesting way to evaluate your results , So let’s deep dive into it …

This is confusion matrix of two different forms ,actually matrix changes when positives and negatives are taken in different rows and columns ,there are another two forms of the same matrix if these predicted positives and negatives are taken in two columns instead of rows ,but in all four cases the real values lies the same . Here, I am depicting the second image ……

Well, these two columns are about Actual negatives (0) and Actual positives(1) as in binary classifications We define our positive data points as 1 and negative data points as 0 ,so two of these whole columns talk about the entire positive points and entire negative points of the datasets and the first row is all about the predicted negatives which is defined as 0 in the matrix and the second row is the predicted positives which is obviously defined as 1 so like the binary classifications . Now let’s define the first cell (0,0) TN i.e True Negative which means that the actual values are negative and also the same values are predicted to be negative as well. Next proceed to the next cell (0,1) FN i.e False Negative which means that the points (or results) are actually positive but it is predicted as negative which means that these points are misclassified ,sort of type-II error of hypothesis testing . Next cell (1,0) FP is False Positives which means that these points are actually false but it is predicted as to be positives which is of course an another form of misclassification ,probably the type-I as the hypothesis testing is concerned .Next it is TP i.e True Positive which means that these points are actually positive and the prediction also tells it so .

The accuracy intuitively through the confusion matrix is =(TP+TN)/(TP+FP+TN+FN) , as True positives and True negatives are the primary objectives to be high for the good accuracy ,We want to keep these two scores as high as possible .

Now there are some other terms of metrics which are precisely based on confusion matrix like TPR,FPR,TNR and FNR.

TPR=TP/(TP+FN),TPR is True Positive Rate i.e the total number of positive points which are predicted to be true and also true in reality divided by the total actual true points , here we can see that total number of data points are 165 ,amongst those 100 is TP and 110 is actual positive so TPR is 100/110=0.90 which is merely high as we see TP is high ,TPR is also known as Sensitivity and Recall as well . Recall is a very important metric for some other business complex problems where it is evident of a high Recall which is to be discussed later in this blog .

FPR=FP/(FP+TN), FPR is False Positive Rate which is the total False positives divided by total number of actual negatives which tells that if Actual points are negative how often does it predict incorrectly and conversely for the TPR we can say it defines that if actual points are positive how often does it predict correctly .

TNR =TN/(TN+FP),TNR is True Negative Rate i.e is the total of actual and predicted negatives divided by total number of actual negatives which is also known as Specificity .

Lastly ,there is an another metric which is of course and always business intuitive that is Precision , Precision=TP/(TP+FP) which tells the actual and predicted positives with respect to the total number of predicted positive points that can be interpreted as if the points are predicted to be positive how often does it predict correctly .

Well , in such analysis in the binary classifications if we try to optimize Precision it will cause a reduction in recall by less predicting the positive data points so it happens in vice-versa .

F1 Score ,this metric is required when it is subjected for both precision and recall to stay high actually.

F1 score =2*((Precision*Recall)/(Precision+Recall)),this formula is a lot like harmonic mean .F1 score lies between 0 to 1 and may be it is not model interpretable but still it has its own different values in evaluating models.

As for the binary classification probability plays a pivotal role here I am going to dive in metrics based on probabilities . ROC and AUC totaling it into AUROC is another important metric for evaluating our newly built models .

ROC is Receiver Operating Characteristic curve ,it has an interesting history rooted with the World War-2 of sensing signals through receivers of enemies and this term merely comes from Electronics and Communication background .

We get different probabilities for data points of being positives or negatives and when we get probabilities more than 0.5 i.e half of the probabilities we will place it as positive points else negative points ,so this 0.5 here works as a thresholds that has actually set the boundary of differentiating data points ,for a standard problem it is okay to take 0.5 as a threshold but requirements may differ along with the change of different business needs ,we shouldn’t be sure over the fact before seeing the confusion matrices that what will be our standard thresholds for drawing decision boundaries ,so we will take each data points’ probabilities as thresholds and see which does fulfill our requirements ,so for every thresholds we will get different sets of predicted data points ,so as different confusion matrices but it will be very confusing and frustrating to compute so many confusion matrices and to get some insights from it .

So to get rid of such confusions researchers have come up with the idea of ROC ,it is like for every thresholds we are choosing we will get separate TPRs and FPRs and we can plot FPRs along the X-axis and TPRs along the Y-axis and finally we will get a curve based on these TPRs and FPRs which is known as the ROC curve and the area under the ROC curve is known as AUC .

from this graph we can choose the correct thresholds for our problem and overall the AUROC gives us the intuition that how accurate our model is i.e how good it can classify points by drawing its decision boundaries ,Blue line here in this picture is known as the diagonal line ,if our model is a random model then its AUROC will be 0.5 and it goes along with the diagonal line which means it cannot classify between two points ,best will be 1 and worst will be 0 ,if our AUROC is close to 1 that means that model is good and if our model is close to 0 that means the performance will be terrible which is not accepted .

Now , the two basic requirements of Precision and Recall that claims to be high is lot more dependent on the minimal number of two misclassified cells in two different ways and here it is described ……

  1. Suppose We do analysis on cancer detection datasets whether the patients have cancer or not then our objective will be in confusion matrix besides keeping TN and TP high to keep FN as small as possible ,because FN stands for False negatives which means We are detecting such patients as having no cancer but originally these are positive points i.e this patients do have cancer in reality which is like committing a blunder ,in such cases We may keep FP a little bigger which won’t matter as much being another misclassified cells but it wouldn’t affect as much because if we predict patients as having cancer but in reality they don’t have has only made them more careful about themselves but bigger values of FN has created a mess around it .So here we need a minimal FN which ends up increasing the recall and in such cases to stay safer we will choose threshold or cutoff very less may be around 0.1 or 0.2 referenced with hypothesis testing problem which consequently increases predicted 1’s and decreasing zeroes .
  2. Another problem is if we make classifications on bank datas whether their clients will subscribe bank schemes or not then our basic objective in confusion matrix besides keeping TN and TP high is to reduce FP i.e false positives which means We have predicted that points are true but in the reality it is actually false which is like implying a terrible result because We have detected that clients will subscribe their schemes but in the reality it is not which in the long run in the business draws an insightful impact itself , so to stay safe here unlikely we will increase the threshold pretty high to say whether clients will subscribe schemes or not ,so consequently we will get less 1’s and more zeroes which intuitively increases Precision .

--

--