We Need to Change How Image Datasets are Curated

Why many gold-standard computer vision datasets, such as ImageNet, are flawed

Catherine Yeo
Fair Bytes

--

ImageNet

Even though it was created in 2009, ImageNet is the most impactful dataset in computer vision and AI today. Consisting of more than 14 million human-annotated images, ImageNet has become the standard for all large-scale datasets in AI. Every year, ImageNet even hosts a competition (ILSVRC) to benchmark progress made in the field.

There’s no denying ImageNet’s influence and importance in computer vision. However, with the growing evidence of biases that lie in AI models and datasets, we must consider the curation process with awareness of ethics and social contexts to improve for future datasets.

A recent paper by Vinay Prabhu and Abeba Birhane found that there are issues of concern we must consider in such large-scale datasets (primarily ImageNet, but also others including 80 Million Tiny Images and CelebA). They also outline solutions to mitigate these concerns and call for mandatory Institutional Review Boards for large-scale dataset curation processes.

This article summarizes their findings below. You can read their full preprint on arXiV here.

One Line Summary

--

--

Catherine Yeo
Fair Bytes

Harvard | Book Author | AI/ML writing in @fairbytes @towardsdatascience