Source: Pixabay

Why Accuracy is Troublesome, Even With Balanced Classes

Accuracy is overrated


Accuracy is a go-to metric because it’s highly interpretable and low-cost to evaluate. For this reason, accuracy — perhaps the most simple of machine learning metrics — is (rightfully) commonplace. However, it’s also true that many people are too comfortable with accuracy.

Being aware of the limitations of accuracy is essential.

Everyone knows that accuracy is misused on unbalanced datasets: for instance, in a medical condition dataset, where the majority of people do not have condition x (let’s say 95%) and the remainder do have condition x.

Since machine learning models are always looking for the easy way out, and especially if an L2 penalization is used (a proportionately less penalty on lower errors), the model can comfortably get away with 95% accuracy only guessing all inputs do not have condition x.

The reply to this common issue is to use some sort of metric that takes into account the unbalanced classes and somehow compensates lack of quantity with a boost of importance, like an F1 score or balanced accuracy.

Beyond this common critique, however — which doesn’t address other limitations of accuracy — there are some other problems with using accuracy that go beyond just dealing with balanced classes.

Everyone agrees that training/testing and deployment of a model should be kept separate. More specifically, the former should be statistical, and the latter should be decision-based. However, there is nothing statistical about turning the outputs of machine learning models — which are (almost) always probabilistic — into decisions, and evaluating its statistical goodness based on this converted output.

Take a look at the outputs for two machine learning models: should they really be getting the same results? Moreover, even if one tries to remedy accuracy with other decision-based metrics like the commonly prescribed specificity/sensitivity or F1 score, the same problem exists.

Image created by author

Model 2 is far less confident in its results than Model 1 is, but both receive the same accuracy. Accuracy is not a legitimate scoring rule, and hence it is deceiving in an inherently probabilistic environment.

While it can be used in the final presenting of a model, it leaves an empty void of information pertaining to the confidence of the model; whether it actually knew the class for most of the training samples or if it was only lucky in crossing on the right side of the 0.5 threshold.

This is also problematic. How can a reliable loss function — the guiding light that shows the model what is right and what is wrong — completely tilt its decision 180 degrees if the output probability shifts 0.01%? If a training sample with label ‘1’ received predictions 0.51 and 0.49 from model 1 and model 2, respectively, is it fair that model 2 is penalized at the full possible value? Thresholds, while necessary for decision-making in a physically deterministic world, are too sensitive and hence inappropriate for training and testing.

Speaking of thresholds — consider this. You are creating a machine learning model to decide if a patient should receive a very invasive and painful surgery treatment. Where do you decide the threshold to give the recommendation? Instinctively, most likely not at a default 0.5, but at some higher probability: the patient is subjected to this treatment if, and only if, the model is absolutely sure. On the other hand, if the treatment is something less serious like an aspirin, it is less so.

The results of the decision dictate the thresholds for forming it. This idea, hard-coding morality and human feeling into a machine learning model, is difficult to think about. One may be inclined to argue that over time and under the right balanced circumstances, the model will automatically shift its output probability distributions to a 0.5 threshold and manually adding a threshold is tampering with the model’s learning.

The rebuttal would be to not use decision-based scoring functions in the first place, not hard-coding any number, including a 0.5 threshold, at all. This way, the model learns not to cheat and take the easy way out through artificially constructed continuous-to-discrete conversions but to maximize its probability of correct answers.

Whenever a threshold is introduced in the naturally probabilistic and fluid nature of machine learning algorithms, it causes more problems than it fixes.

Loss functions that treat probability on the continuous scale it is instead of as discrete buckets are the way to go.

What are some better, probability-based, and more informative metrics to use for honestly evaluating a model’s performance?

  • Brier score
  • Log score
  • Cross-entropy

In the end, accuracy is an important and permanent part of the metrics family. But for those who decide to use it: understand that accuracy’s interpretability and simplicity comes at a heavy cost.

