We Need to Change How Image Datasets are Curated

Why many gold-standard computer vision datasets, such as ImageNet, are flawed

Catherine Yeo
Jul 2, 2020 · 4 min read
ImageNet

Even though it was created in 2009, ImageNet is the most impactful dataset in computer vision and AI today. Consisting of more than 14 million human-annotated images, ImageNet has become the standard for all large-scale datasets in AI. Every year, ImageNet even hosts a competition (ILSVRC) to benchmark progress made in the field.

There’s no denying ImageNet’s influence and importance in computer vision. However, with the growing evidence of biases that lie in AI models and datasets, we must consider the curation process with awareness of ethics and social contexts to improve for future datasets.

A recent paper by Vinay Prabhu and Abeba Birhane found that there are issues of concern we must consider in such large-scale datasets (primarily ImageNet, but also others including 80 Million Tiny Images and CelebA). They also outline solutions to mitigate these concerns and call for mandatory Institutional Review Boards for large-scale dataset curation processes.

This article summarizes their findings below. You can read their full preprint on arXiV here.

One Line Summary

Harms and Threats

1) Lack of Consent

Photographers don’t upload your photos for the whole world to see without your consent, so why shouldn’t image datasets account for consent?

Source: Facebook

2) Loss of Privacy

With a simple reverse lookup, we could potentially find one’s full name, social media accounts, occupation, house address, and many other data points we never agreed to give away. (We may not have agreed to give away our face to the dataset in the first place).

3) Perpetuation of Harmful Stereotypes

For example, MIT’s 80 Million Tiny Images dataset contains harmful slurs, potentially labeling women as “whores” or “bitches” and minority racial groups with offensive language.

Once trained on biased data, machine learning algorithms can not only normalize but amplify stereotypes.

Solutions

1) Remove and Replace

2) Differential Privacy

3) Dataset Audit Cards

Final Thoughts

For more information, check out the original paper on arXiv here.

Vinay Uday Prabhu and Abeba Birhane. “Large Image Datasets: A Pyrrhic Win for Computer Vision?”

Fair Bytes

Sharing byte-sized stories about fairness & ethics of AI

Catherine Yeo

Written by

CS @Harvard | I write about fairness & ethics in AI/ML for @fairbytes | Storyteller, hacker, innovator | Visit me at www.catherinehyeo.com

Fair Bytes

A Medium publication sharing byte-sized stories about research, resources, and issues related to fairness & ethics of AI

Catherine Yeo

Written by

CS @Harvard | I write about fairness & ethics in AI/ML for @fairbytes | Storyteller, hacker, innovator | Visit me at www.catherinehyeo.com

Fair Bytes

A Medium publication sharing byte-sized stories about research, resources, and issues related to fairness & ethics of AI

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store