Evaluating models using the Top N accuracy metrics

Rushabh Nagda
NanoNets
Published in
2 min readNov 8, 2019

Often, while building machine learning models, we focus on the accuracy metrics, trying to get the right class of an image or the right category for a paragraph of text. But these tasks, if only measured on the accuracy of the highest probability prediction limits our understanding of the network and limits the areas it can be applied to.

Before getting into the details, let’s define the two terms first.

Top 1 accuracy — As the name suggests, in an image classification problem, you extract the maximum value out of your final softmax outputs — the value that corresponds to the confidence for the predicted class for your input.

Top N accuracy — Top N accuracy is when you measure how often your predicted class falls in the top N values of your softmax distribution.

Let’s take an example -

Say you have an image classification model with 5 classes — dog, cat, giraffe, mouse and bug. You test your model on 5 images and get the following results.

Here, the predicted label gets it right 3/5 times. But for the same case, the true label turns up 4/5 times in the top 3 labels

Therefore,

Top1 accuracy — 60%

Top3 accuracy — 80%

We will try to understand why you might need the top-N accuracy instead of just the top1 accuracy by following a use-case.

Imagine you are trying to build a recommender system for your e-commerce app. Earlier, the recommendations would mostly be based on the products that have been popular for years, for months, for days. With increase in the data being gathered by businesses, there is also a need to put this data into use so businesses can generate value from it. Your engine can now find several objects you can buy based on several factors that include your behaviour in the past, the category of the products you have bought in the past, the price-range of those products, the vendors of those products, etc.

More often than not, people will end up buying not your top recommendation but something that turns up on the list later and in some cases, doesn’t appear on the list at all. The accuracy paradox turns up here — predictive models that have a certain level of accuracy may have a higher predictive power than models that have a higher accuracy. A person looking through an e-commerce platform is not looking for the first best bet he can pick. He/she is looking for diversity, novelty, coverage, serendipity and relevance.

In such a situation, besides tweaking algorithms to take into account the different ways a customer experiences the platform and how he/she interacts with it, a safer metric to evaluate the performance of your model would be a top N accuracy.

--

--