Confusion Matrix : Let’s clear this confusion.

Aatish Kayyath
4 min readOct 28, 2021

--

In the field of Machine Learning , a confusion matrix or error matrix is used to visualize the performance of a supervised ML algorithm, typically a classification algorithm. Let’s understand more about this below.

a confusion matrix
  • TP : Observation is Positive and truly predicted as positive
  • FN : Observation is Positive and falsely predicted as negative
  • FP : Observation is Negative and falsely predicted as positive
  • TN : Observation is Negative and truly predicted as negative

Classification Rate / Accuracy

Accuracy is basically total correctly classified points divided by total points. However, there are problems with accuracy. It assumes equal costs for both kinds of errors. A 99% accuracy can be excellent, good, mediocre, poor or terrible depending upon the problem.

Recall

Recall(also called sensitivity) can be defined as the total number of correctly classified positive examples divide by the total number of positive examples. High Recall indicates the class is correctly recognized (a small number of FN). Recall indicates how many examples its able to classify true from the total true data points. ex- if recall is 75% then it means it only detects 75% of the actual True cases.

Precision

To get the value of precision we divide the total number of correctly classified positive examples by the total number of predicted positive examples. High Precision indicates an example labelled as positive is indeed positive (a small number of FP). Precision tells what what percentage of time it is actually making a right prediction. ex — if precision is 72% then it is correctly predicting only 72% of the time.

High recall, low precision: This means that most of the positive examples are correctly recognized (low FN) but there are a lot of false positives which means even wrong values are getting classified.
Low recall, high precision: This shows that we miss a lot of positive examples (high FN) but those we predict as positive are indeed positive (low FP).

Precision/Recall Trade-Off:

So for a particular data point the classifier computes a score based on the decision function, now if that score is greater than a given threshold then it will assign it to the positive class, else to the negative class. Now what happens is when you increase the threshold the precision increases but recall decreases as you may miss out on a few examples.

Now as you decrease the threshold you will mostly cover all the true examples but you might also classify some false examples as true hence decreasing the precision and increasing the recall. Precision may in very rare cases sometimes go down when you increase threshold(edge cases)

Precision/Recall Curve:

F-Measure

The F-score, also called the F1-score, is a measure of a model’s accuracy on a dataset. It is used to evaluate binary classification systems, which classify examples into Confusion Matrix 4 ‘positive’ or ‘negative’.
The F-score is a way of combining the precision and recall of the model, and it is defined as the harmonic mean of the model’s precision and recall.
The F-score is commonly used for evaluating information retrieval systems such as search engines, and also for many kinds of machine learning models, in particular in natural language processing.

ROC Curve

The receiver characteristic operating curves another way of calculating model accuracy.
ROC curve plots the recall vs false positive rate(negative instances that are incorrectly classified as positive).

A good ROC curve is which is towards the top left corner.

One way to compare classifiers is to measure the AUC(area under the curve). A good classifier will have an AUC of 1 whereas a random classifier will have an AUC of 0.5, so if the model has less than 0.5 AUC score then its worse than a random probability model

Since ROC curve is similar to the PR curve- which one to use? Use PR curve whenever positive class is rare or you care more about the false positives than the false negatives.

--

--

Aatish Kayyath

Data Scientist | ex-WWT | ex-Evalueserve | ex-Impact Analytics