The order of ML model performances ranking changed after removing 14,000 label errors on ImageNet

The importance of managing dataset quality

5 min readAug 23, 2022

Hello, I’m Kenichi, an engineer at Adansons inc.

In my company, we are developing a product that simplifies the management of unstructured data and metadata and allows for more detailed interpretation and evaluation of the performance and characteristics of trained AI.

In this article, I would like to show the importance of managing the quality of datasets. You might remember that the existence of label errors on ImageNet was reported and that it became a hot topic a while ago.

Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks

We identify label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio…

arxiv.org

Since I sometimes use ImageNet as a benchmark and models pre-trained on ImageNet, I felt a sense of urgency that the results I have obtained so far could change depending on the data quality.

In this article, I want to share my attempt to exclude the error data based on the label error reported in this paper[1] from the ImageNet and re-evaluate the models published on torchvision.

Removing error data from ImageNet and re-evaluating the models

There are three kinds of label errors as shown in the table below.

(1) mislabeled data (2) data that corresponds to multiple labels (3) data that don’t belong to any label

14,000 errors, that’s a lot! Considering that the number of evaluated data was 50,000, we can see that a high percentage of error data is included there. The actual error data is as follows.

Method

In this article, we re-examined the accuracy of models by excluding only (1) the mislabeled data and excluding all error data (1)-(3) from the evaluation data. (we didn’t re-train)

To remove error data, a metadata file describing the label error information is used. In this metadata file, if any of the errors (1)-(3) are included, the information is described in the “correction” attribute.

We use a tool we developed in-house called Adansons Base. Adansons Base filters datasets by linking them to metadata. For details, please see the NOTEBOOK below that summarizes this verification.

base/02_clean_imagenet.ipynb at main · adansons/base

Adansons Base is a data programming tool for error-analysis of training results. It organizes metadata of unstructured…

github.com

I tested the following 10 models.

Result

The results are summarized in the table below. (Values are precision in %. Ranks in parentheses)

Using the All Eval data as a baseline, the accuracy score improved by an average of 3.122 points for the “Except mislabeled data” excluding the error data (1) and by an average of 11.743 points for the “Except all error data” excluding the error-data(1)-(3).

Not surprisingly, the accuracy improved across the board when I excluded the error data, which would be prone to error compared to clean data.

Notably, the variation in the accuracy ranking of the models is of particular interest when evaluated without excluding error data and when (1)~(3) are all excluded.

And also, in fact, the 3670 pieces of data given different labels in (1) represent 7.34% of the total 50000 pieces of evaluation data, yet the average increase in accuracy is only about 3.22 points. Although the accuracies cannot be simply compared because of the change in the size of the population, we can see that on average, the model is over-trained to classify 1550 data, or nearly half of the 3670 mislabeled data, as correct for the wrong label.

Conclusion

Although not performed in this inspection, it goes without saying that it is important to use accurately labeled data not only during evaluation but also during training.

It is possible that previous studies have drawn incorrect conclusions when comparing accuracies among models. It is supposed to be evaluation data, but can it really be used to evaluate the performance of models?

It seems to me that many models using deep learning often disdain reflecting on the data and are eager to improve the accuracy and other evaluation metrics by the expressiveness of the model. However, there is no point in accurately processing even the error data contained in the evaluation data.

Proper evaluation of the model requires the use of accurate, high-quality datasets.

Especially when we create our own datasets, such as when applying AI in business, creating a high-quality dataset is directly related to improving the accuracy and reliability of AI. The results of this inspection show that simply improving the quality of the data increased accuracy by approximately 10 percentage points, indicating the importance of improving not only the model but also the dataset when developing an AI system.

However, maintaining the quality of data sets is not easy. While it is important to increase the amount of metadata in order to properly assess the quality of AI models and data, it can be cumbersome to manage, especially with unstructured data.

Therefore, I am working on developing tools to make it easier to do so.

Check out Github if you are interested.

GitHub - adansons/base: Adansons Base is a data programming tool for error-analysis of training…

Adansons Base is a data programming tool for error-analysis of training results. It organizes metadata of unstructured…

github.com

References

[1] Beyer, L., Hénaff, O. J., Kolesnikov, A., Zhai, X., & Oord, A. V. D. (2020). Are we done with imagenet?. arXiv preprint arXiv:2006.07159.

The order of ML model performances ranking changed after removing 14,000 label errors on ImageNet

The importance of managing dataset quality

Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks

We identify label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio…

Removing error data from ImageNet and re-evaluating the models

Method

base/02_clean_imagenet.ipynb at main · adansons/base

Adansons Base is a data programming tool for error-analysis of training results. It organizes metadata of unstructured…

Result

Conclusion

GitHub - adansons/base: Adansons Base is a data programming tool for error-analysis of training…

Adansons Base is a data programming tool for error-analysis of training results. It organizes metadata of unstructured…

References

Written by Kenichi Higuchi