The evaluation metrics and error analysis in ML projects

Overfitted Cat
4 min readJan 6, 2022

--

In this blog post, we continue the journey through the machine learning projects. Having understood the data and the problem from the last blog, our unfortunate hero John can do what is right. We are going to explore evaluation metrics and basic error analysis.

Set an evaluation metric

What does it mean to have a 99% accuracy? Most people can’t contemplate the meaning of probability. We, as experts, need to stand up as authorities and earn stakeholders’ trust by communicating what is possible and what is not. Most people aren’t interested in your algorithm choice. Did you use the latest neural network or linear regression to achieve the goal? Customers probably won’t care. They’ll want something that works, at least most of the time :)

Is accuracy the right choice as a goal metric? Might be, might not, but one is sure. It is one of the easiest to understand. Sometimes it can be misleading, and some other metric would be a better fit. Whatever you choose as your evaluation metric, be sure everyone understands what it means.

Note: Always make the algorithm decision based on the dev set and report metrics from the test set. You want to reduce chances to overfit your test set.

It is worth mentioning to have a single straightforward evaluation metric. It would be easier to contemplate and compare it between approaches. You can always combine several metrics into one, but be aware of explaining it to someone else. It can cause a lot of confusion and headaches. Sometimes, having multiple metrics can help you choose between similar performing algorithms. For example, you might want to have accuracy as higher as possible within 100ms inference time. It will speed up progress for sure.

Probably, there will be times that chosen metric doesn’t point us in the right direction anymore. It is no sin to change it.

The error analysis

Remember these lines from the first blog?

“I thought you fixed this case…”
“It doesn’t work here…”

These are a misunderstanding of how machine learning works. Sometimes you must live with uncertainty. You can’t go around fixing cases as they come. You’ll waste everyone’s time.
As with most machine learning algorithms finding and reasoning with a single instance that doesn’t work could be tedious. What is more important is how many times this happens? How is this tolerable to happen? The world of machine learning (and computer science and everything in life) is the world of trade-offs. It goes along with the previous section about choosing your metric. If you start fixing every single issue, you’ll end up wandering without a target. Exploring all ideas and possibilities would take a lot of time. More importantly, there won’t be any knowledge of improvements. You might end up just making things worse for some other instances. Here comes an error analysis to the rescue.

In simple terms, the error is the process of looking at the misclassified examples in your dev set. The way we can do it is to take a sample misclassified from the dev set and manually go through it. The point is to find recurring errors and to group them by cause. The percentage of errors can tell us how much we can improve the overall system if we improve the error.
For example, an image classifier has an accuracy of 90%, and 10% are errors. After performing an error analysis, the dark images contribute to 40% of all errors in the sample, and mislabeled images take 6% of all errors. By improving mislabeled examples, we can achieve at most 0.6% improvement, having a total accuracy of 90.6%. It is not a superior impact in most cases. Improving the darker images would bring us better performance, at most 4%.

The error analysis is a powerful tool that can force us to understand the data and the model. It can point us in the right direction of choosing the next idea for the improvements.

Endnote

This blog post explored the idea of setting evaluation metrics and performing the error analysis. The evaluation metric can give us a better way of selecting the model. The error analysis can point us on the right path during the research. However, one of the most effective validations of your models is AB tests. Our model can be 99% accurate on the test set, but it can fail on the live data due to distribution shift, simple overfitting, sun rays angle, etc. The AB tests provide a business metric evaluation that would tell us the real impact of our ML model. Let’s talk about AB tests somewhere in the future.

You can find more details in the book Machine Learning Yearning by Andrew Ng.

--

--