Metrics For Logistic Regression

ARIJIT BARAT
CodeX
Published in
11 min readMar 29, 2022

The above picture depicts how sinful it is if you just deploy your model without measuring it with suitable metrics.

For a machine learning professional, being able to evaluate the performance of different models is as important as being able to train them. The simplest way to measure the performance of a classification model is by calculating its accuracy, but accuracy doesn’t paint the whole picture, because some models exhibit great accuracy but are not good models, so we discuss the following metrics in this article-:

  • Accuracy
  • Recall
  • Precision
  • Sensitivity
  • Specificity
  • F-Score
  • ROC curve
  • AUC

Confusion matrix overview-:

Fig-1: Confusion Matrix
  • TP → True Positive.
  • TN → True Negative.
  • FP → False Positive(Type-1 error).
  • FN → False Negative(Type-2 error).
  • P → Sum of the actual positive class.
  • N → Sum of the actual negative class.
  • P → Sum of the predicted positive class.
  • N’ → Sum of the predicted negative class.

Note -: P’ + N’ = P + N

Let’s try to understand the above abbreviations with example of different scenarios:

Scenario-1: Classifier detecting whether a person is sick or healthy.

True positive: A sick person who is diagnosed as sick.

True negative: A healthy person who is diagnosed as healthy.

False positive: A healthy person who is incorrectly diagnosed as sick.

False negative: A sick person who is incorrectly diagnosed as healthy.

Scenario-2: Classifier detecting whether a mail, I received is spam or ham.

True positive: A spam email that is correctly classified as spam.

True negative: A ham email that is correctly classified as ham.

False positive: A ham email that is incorrectly classified as spam.

False negative: A spam email that is incorrectly classified as ham.

Note -: In scenario-1, False negative costs more, because obviously it is dangerous, if we predict a sick person as healthy & advice not to take any medicines, it will cost someone’s life.

While in scenario-2, False positive costs more, because just think if you get an offer letter & your classifier predicts it as a spam mail, it will cost your job.

So prioritizing, False positive or False negative totally depends on the story behind your data.

Accuracy -:

Accuracy is the number of correctly predicted data points out of all the data points.

Fig-2: Formula for accuracy

Similarly, we also make use of the terms “error rate or misclassification rate” to keep note of the numbers of incorrectly predicted data points out of all the data points, which is (1 — Accuracy).

Note -: If we have imbalanced data set, then we can never use Accuracy as a metric.

Now the question is why we cannot use Accuracy when we have an imbalanced data set(i.e. Fraud detection or Breast cancer data set)?

Here we go with the explanation for the above question → In an imbalanced data set the main class of interest is always rare. That is, the data set distribution reflects a significant majority of the negative class and a minority positive class.

To understand the above statement let’s take an example, say in fraud detection applications, the class of interest (or positive class) is “fraud,” which occurs much less frequently than the negative “non-fraudulent” class. In medical data, there may be a rare class, such as “cancer.” Suppose that you have trained a classifier to classify medical data, where the class label attribute is “cancer” and the possible class values are “yes” and “no.” An accuracy rate of say 97% may make the classifier seem quite accurate, but what if only, say, 3% of the training tuples are actually cancer? Clearly, an accuracy rate of 97% may not be acceptable — the classifier could be correctly labeling only the noncancer tuples, for instance, and misclassifying all the cancer tuples. Instead, we need other measures, which access how well the classifier can recognize the positive tuples (cancer = yes) and how well it can recognize the negative tuples (cancer = no).

The Sensitivity and Specificity measures can be used, respectively, for this purpose. So next let’s discuss sensitivity & specificity.

Sensitivity & Specificity -:

Sensitivity → It determines the proportion of positive tuples that are correctly identified by the classifier.

Fig-3: Formula for Sensitivity.

Specificity → It determines the proportion of negative tuples that are correctly identified by the classifier.

Fig-4: Formula for Specificity.

Now let’s take an example and understand how Sensitivity differs from Specificity.

Example -:

Fig-5: Confusion matrix for the classes cancer = yes and cancer = no.

The accuracy of the classifier is 9650/10,000 = 96.50%

The sensitivity of the classifier is 90/300 = 30.00%

The specificity of the classifier is 9560/9700 = 98.56%

Note that although the classifier has high accuracy, but it’s ability to correctly label the positive (rare) class is poor given its low sensitivity. It has high specificity, meaning that it can accurately recognize negative tuples but not the positive tuples.

Thus this classifier is NOT GOOD 😦 as it cannot predict the patients who actually have cancer(i.e. its have very low sensitivity). Means this model have more False Negative and its dangerous if we suggest a person to not undergo through any treatment though actually he have cancer.

Now just think of another situation, where a classifier detects spam mail. Here class = ‘Yes’ means its a spam mail which is positive class & class = ‘No’ means means its a ham mail which is negative class.

Lets assume that we got the same accuracy, sensitivity and specificity.

So this time we may say that this classifier is good because it has high specificity, means none of our important mails will be send to spam by default as a result we will not miss any important mail from inbox. But yes as the sensitivity is less, so we may get some spam mails in our inbox as ham mail.

Note -: In cancer data set the classifier is bad if we have too many ‘False Negative’ because then we predict a cancer patient to be non cancerous. While in spam mail data set the classifier is bad if it has too many ‘False Positive’ because then we predict ham mail as spam.

So, the metric to be used always depend on the data set and the story behind the data.😃

Precision & Recall -:

Precision → It measures what percentage of data predicted as positive by the classifier are actually such.

Fig-6: Formula for Precision

Let’s take an example & understand Precision→ Refer fig-5, where a total of 230 in the first column are predicted as +ve by the classifier. But to find how many actually belong to the +ve class, we use precision, therefore (90/230) is the calculated value for precision as per formula.

Recall(same as sensitivity) → It determines the proportion of positive tuples that are correctly identified by the classifier.

Fig-7: Formula for Recall

Similarly, refer to fig-5 for recall, say suppose we know that total 300 data points actually belong to +ve class. Then the Recall or Sensitivity actually tells us about what percentage of +ve tuples are labeled(or predicted) as such correctly by the classifier.

That means in fig-5, 90 +ve data points are predicted correctly by the classifier out of total 300 actual +ve data points. Hence as per the recall formula, we calculate it as (90/300).

Important Note on Recall & Precision WRT False -ve & False +ve with an example -:

Recall → A metric that measures how well our model did with false negatives(Ex-: In coronavirus test, we are more concerned about false -ve rather than false +ve, because we do not want to tell a person is -ve, when actually he is +ve).

This model can’t afford to have too many false -ve.

Example -:

Fig-8: Confusion matrix on coronavirus test.
Fig-9: Model-1
Fig-10: Model-2

Note -: From above example we conclude that recall value goes down if we have too many false -ve or recall can’t afford to have too many false -ve.

Precision → A metric that measures how well our model did with false positive(Ex- A spam email classifier, where it detects spam mails as positive scenario, so false positive means the classifier will classify an important mail as spam, which will cost more to me rather than false -ve, where I would just read a spam mail extra).

This model can’t afford to have too many false +ve.

Example -:

Fig-11: Classifier Model 1 & Model 2

Note -: From above example we conclude that precision value goes down if we have too many false +ve or precision can’t afford to have too many false +ve.

F-Score -:

It’s a harmonic mean of “precision” & “recall”, taking both the metrics into account.

Fig-12: Formula for F-Score

Here we take the harmonic mean instead of the simple mean because the simple mean would not punish the extreme values but this harmonic mean will do so.

Let’s take an example →

Say a model gives, precision = 1.0 & Recall = 0.0.

— — — — — — —

Simple mean calculates = (1+0)/2 = 0.5

Harmonic mean calculates = [2*(1*0)]/(1+0) = 0

— — — — — — —

Thus we conclude that harmonic mean always brings down the overall F-score value when anyone (precision or recall) value is less. F-score punishes the model with low recall or precision value.

Fβ Score -:

Fβ score uses a parameter called β (the Greek letter β), which can take any positive value. The point of β is to act as a dial that we turn to emphasize precision or recall. More specifically, if we slide the β dial to zero, we get full precision; if we slide it to infinity, we get full recall. In general, the lower the value of β, the more we emphasize precision, and the higher the value of β, the more we emphasize recall.

So now, you can decide whether to punish your model for low precision or for low recall by dialing ‘β’ knob.

The Fβ score is defined as follows (where precision is P and recall is R)-:

Fig-13: Fβ-score formula

ROC curve -:

The Receiver Operating Characteristic curve is drawn based on the various threshold between 0 & 1, where varying this threshold we can increase or decrease the specificity & sensitivity.

In logistic regression where we predict binary class, there we choose a threshold value, where if the predicted probability by the classifier is more than the threshold then we chose one class, or if the predicted probability by the classifier is less than the threshold then we chose another class. This threshold actually helps to regulate the specificity & sensitivity in the ROC curve.

As we move the threshold from low to high(i.e. 0 to 1), the sensitivity decreases, and the specificity increases.

This means in the coronavirus test if we want to decrease the false -ve then we have to choose a lower threshold, while in the spam mail classifier if we want to decrease the false +ve then we need to choose a higher threshold value.

The below diagram shows how sensitivity decreases, and the specificity increases, with an increase in threshold value from (0 to 1).

Fig-14: Specificity and sensitivity for different thresholds

The below diagram represents the ROC Curve in a graph with sensitivity & specificity on X & Y axis respectively along with the threshold from 0 to 1 with each dot labeled by the timestep→

Fig-15: Here we can see the ROC curve corresponding to our ongoing example, which gives us a great deal of information on our model. The highlighted dots correspond to the timesteps obtained by moving our threshold from 0 to 1, and each dot is labeled by the timestep. On the horizontal axis we record the sensitivity of the model at each timestep, and on the vertical axis we record the specificity

In short ROC curve helps us to get the best threshold value for our model.

AUC (Area Under The ROC Curve) -:

AUC is the measure of the capability of a classifier to distinguish between classes. It measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).

Below in fig-16, we can see three models, in which the prediction is given by the horizontal axis (from 0 to 1). On the bottom, you can see the three corresponding ROC curves. Each one of the squares has size 0.2 times 0.2. The number of squares under each curve are 13, 18, and 25, which amounts to areas under the curve of 0.52, 0.72, and 1. Note that the best a model can do is an AUC of 1, which corresponds to the model on the right. The worst a model can do is an AUC of 0.5, because this means the model is as good as random guessing. This corresponds to the model on the left. The model in the middle is our original model, with an AUC of 0.72.

Fig-16: In this figure, we can see that AUC, or area under the curve, is a good metric to determine how good a model is. The higher the AUC, the better the model. On the left, we have a bad model with an AUC of 0.52. In the middle, we have a good model with an AUC of 0.72. On the right, we have a great model with an AUC of 1.

Similarly, we can have ROC & AUC, hand in hand, as in below diagram fig-17, we can see, if we want our model to have high sensitivity, we just push the threshold to the left (i.e., decrease it) until we get to a point in the curve that has as much sensitivity as we want. Note that the model may lose some specificity, and that’s the price we pay. In contrast, if we want higher specificity, we push the threshold to the right (i.e., increase it) until we get to a point in the curve that has as much specificity as we want. Again, we lose some sensitivity during this process. The curve tells us exactly how much of one we gain and lose, so as data scientists, this is a great tool to help us decide the best threshold for our model.

Fig-17: The parallel between the threshold of the model and its ROC. The model on the left has a high threshold, low sensitivity, and high specificity. The model in the middle has medium values for threshold, sensitivity, and specificity. The model on the right has a low threshold, high sensitivity, and low specificity.

Refer to the below diagram fig-18, which differentiate the high specificity with high sensitivity.

Fig-18: In this more general scenario, we can see an ROC curve and three points on it corresponding to three different thresholds. If we want to pick a threshold that gives us high specificity, we pick the one on the left. For a model with high sensitivity, we pick the one on the right. If we want a model that has a good amount of both sensitivity and specificity, we pick the one in the middle.

If we need a high sensitivity model, such as the coronavirus model, we would pick the point on the right. If we need a high specificity model, such as the spam-detection model, we may pick the point on the left. However, if we want relatively high sensitivity and specificity, we may go for the point in the middle. It’s our responsibility as data scientists to know the problem well enough to make this decision properly.

So finally we are done, lets wind-up with the summary for this article.

Summary -:

  • Confusion matrix in bird’s-eye →
Fig-19: The top row of the confusion matrix gives us recall and sensitivity: the ratio between the number of true positives and the sum of true positives and false negatives. The leftmost column gives us precision: the ratio between the number of true positives and the sum of true positives and false positives. The bottom row gives us specificity: the ratio between the number of false positives and the sum of false positives and true negatives.
  • Formulas →
Fig-20: Evaluation measures. Note that some measures are known by more than one name. TP,TN,FP,P, N refer to the number of true positive, true negative, false positive, positive, and negative samples, respectively.

We are done, Thanks for your time. For any suggestions on this article please reach out to me on my mail ‘arijitbarat11m@gmail.com’ and please follow me on medium.

Happy learning. 😃.

--

--

ARIJIT BARAT
CodeX
Writer for

Talk about #Data Engineering, #Data Science, #Story with data.