The order of ML model performances ranking changed after removing 14,000 label errors on ImageNet

The importance of managing dataset quality

Kenichi Higuchi
5 min readAug 23, 2022
Photo by John Schnobrich on Unsplash

Hello, I’m Kenichi, an engineer at Adansons inc.

In my company, we are developing a product that simplifies the management of unstructured data and metadata and allows for more detailed interpretation and evaluation of the performance and characteristics of trained AI.

In this article, I would like to show the importance of managing the quality of datasets. You might remember that the existence of label errors on ImageNet was reported and that it became a hot topic a while ago.

Since I sometimes use ImageNet as a benchmark and models pre-trained on ImageNet, I felt a sense of urgency that the results I have obtained so far could change depending on the data quality.

In this article, I want to share my attempt to exclude the error data based on the label error reported in this paper[1] from the ImageNet and re-evaluate the models published on torchvision.

Removing error data from ImageNet and re-evaluating the models

There are three kinds of label errors as shown in the table below.

(1) mislabeled data (2) data that corresponds to multiple labels (3) data that don’t belong to any label

14,000 errors, that’s a lot! Considering that the number of evaluated data was 50,000, we can see that a high percentage of error data is included there. The actual error data is as follows.

Method

In this article, we re-examined the accuracy of models by excluding only (1) the mislabeled data and excluding all error data (1)-(3) from the evaluation data. (we didn’t re-train)

To remove error data, a metadata file describing the label error information is used. In this metadata file, if any of the errors (1)-(3) are included, the information is described in the “correction” attribute.

We use a tool we developed in-house called Adansons Base. Adansons Base filters datasets by linking them to metadata. For details, please see the NOTEBOOK below that summarizes this verification.

I tested the following 10 models.

10 image classification models for test

Result

The results are summarized in the table below. (Values are precision in %. Ranks in parentheses)

Results of 10 classification models

Using the All Eval data as a baseline, the accuracy score improved by an average of 3.122 points for the “Except mislabeled data” excluding the error data (1) and by an average of 11.743 points for the “Except all error data” excluding the error-data(1)-(3).

Not surprisingly, the accuracy improved across the board when I excluded the error data, which would be prone to error compared to clean data.

Notably, the variation in the accuracy ranking of the models is of particular interest when evaluated without excluding error data and when (1)~(3) are all excluded.

And also, in fact, the 3670 pieces of data given different labels in (1) represent 7.34% of the total 50000 pieces of evaluation data, yet the average increase in accuracy is only about 3.22 points. Although the accuracies cannot be simply compared because of the change in the size of the population, we can see that on average, the model is over-trained to classify 1550 data, or nearly half of the 3670 mislabeled data, as correct for the wrong label.

Conclusion

Although not performed in this inspection, it goes without saying that it is important to use accurately labeled data not only during evaluation but also during training.

It is possible that previous studies have drawn incorrect conclusions when comparing accuracies among models. It is supposed to be evaluation data, but can it really be used to evaluate the performance of models?

It seems to me that many models using deep learning often disdain reflecting on the data and are eager to improve the accuracy and other evaluation metrics by the expressiveness of the model. However, there is no point in accurately processing even the error data contained in the evaluation data.

Proper evaluation of the model requires the use of accurate, high-quality datasets.

Especially when we create our own datasets, such as when applying AI in business, creating a high-quality dataset is directly related to improving the accuracy and reliability of AI. The results of this inspection show that simply improving the quality of the data increased accuracy by approximately 10 percentage points, indicating the importance of improving not only the model but also the dataset when developing an AI system.

However, maintaining the quality of data sets is not easy. While it is important to increase the amount of metadata in order to properly assess the quality of AI models and data, it can be cumbersome to manage, especially with unstructured data.

Therefore, I am working on developing tools to make it easier to do so.

Check out Github if you are interested.

References

[1] Beyer, L., Hénaff, O. J., Kolesnikov, A., Zhai, X., & Oord, A. V. D. (2020). Are we done with imagenet?. arXiv preprint arXiv:2006.07159.

--

--

Kenichi Higuchi

PdM & Engineer & Director at Adansons Inc. / Medical Student at Tohoku Univ. / pursuit next gen. of AI / 1st product → https://adansons.wraptas.site