We Need to Change How Image Datasets are Curated
Why many gold-standard computer vision datasets, such as ImageNet, are flawed
Even though it was created in 2009, ImageNet is the most impactful dataset in computer vision and AI today. Consisting of more than 14 million human-annotated images, ImageNet has become the standard for all large-scale datasets in AI. Every year, ImageNet even hosts a competition (ILSVRC) to benchmark progress made in the field.
There’s no denying ImageNet’s influence and importance in computer vision. However, with the growing evidence of biases that lie in AI models and datasets, we must consider the curation process with awareness of ethics and social contexts to improve for future datasets.
A recent paper by Vinay Prabhu and Abeba Birhane found that there are issues of concern we must consider in such large-scale datasets (primarily ImageNet, but also others including 80 Million Tiny Images and CelebA). They also outline solutions to mitigate these concerns and call for mandatory Institutional Review Boards for large-scale dataset curation processes.
This article summarizes their findings below. You can read their full preprint on arXiV here.
One Line Summary
Large scale image datasets have issues we must aim to mitigate and address in future dataset curation processes.
Harms and Threats
1) Lack of Consent
Many of these large-scale datasets freely gather photos, including photos of real people, without consideration of consent. In the Open Images V4–5–6 dataset, Prabhu and Birhane found “verifiably non-consensual images” of children taken from photo sharing community Flickr.
Photographers don’t upload your photos for the whole world to see without your consent, so why shouldn’t image datasets account for consent?
2) Loss of Privacy
When ImageNet was published, reverse image search did not exist. Now, image scraping tools are widespread, and powerful reverse image search engines (e.g. Google Image Search, PimEyes) allow anyone to be able to uncover real identities of humans/faces in a large image dataset.
With a simple reverse lookup, we could potentially find one’s full name, social media accounts, occupation, house address, and many other data points we never agreed to give away. (We may not have agreed to give away our face to the dataset in the first place).
3) Perpetuation of Harmful Stereotypes
How a dataset is labelled and curated could lead to us perpetuating what/who is perceived as “desirable”, “normal”, and “acceptable”. Individuals and groups on the margins would then be perceived as “outliers”.
For example, MIT’s 80 Million Tiny Images dataset contains harmful slurs, potentially labeling women as “whores” or “bitches” and minority racial groups with offensive language.
Once trained on biased data, machine learning algorithms can not only normalize but amplify stereotypes.
1) Remove and Replace
There is precedent of ImageNet removing photos within the “person” subtree when they were recognized to have “potentially offensive labels”. Similar actions could be done for datasets with offensive labels and photos captured in non-consensual settings — remove them, then replace (if possible) with consensually shot financially compensated images.
2) Differential Privacy
Another solution is to blur or obfuscate the individuals’ identities using differential privacy. Differential privacy is a system with quantifiable privacy guarantees where information about a dataset can be publicly shared but information about any single individual is withheld.
3) Dataset Audit Cards
If the publication of datasets were accompanied by audit cards, then everyone would be aware of the dataset’s goals, curation process, limitations, etc. when using the dataset. This is quite similar to the concept of model cards accompanying machine learning models.
This was a fascinating and timely read — it really made me question more about my own decision-making process when I gather and work with data. I sincerely hope that this paper (and any other similar works) motivates researchers to rethink the process for curating large-scale datasets to minimize and avoid these harms and threats.
For more information, check out the original paper on arXiv here.
Vinay Uday Prabhu and Abeba Birhane. “Large Image Datasets: A Pyrrhic Win for Computer Vision?”
Thank you for reading! Subscribe to read more about research, resources, and issues related to fair and ethical AI.
Fair Bytes: A Deeper Lens into Fairness in AI
Understanding algorithmic fairness and ethics is more imperative than ever
How Biased is GPT-3?
Despite its impressive performance, the world’s newest language model reflects societal biases in gender, race, and…
Best Resources to Teach AI Ethics in the K-12 Classroom
Curricula, projects, and even fiction books to empower students to learn about AI ethics
Catherine Yeo is a CS undergraduate at Harvard interested in AI/ML/NLP, fairness and ethics, and everything related. Feel free to suggest ideas or say hi to her on Twitter.