Is data fixable? On the need of socially-informed practices in ML research and education (part 1)
Part 1: Deployment failures and approaches to data
When we approach a Machine Learning (ML) project where we want to solve a specific task with an existing or to-be-designed ML model, we immediately think about data: their availability always, their quantity often, their quality and mode of creation less frequently. If at all, it is only at the testing stage that we may attempt to better understand the dataset used to train the model, to identify the causes for the possible model failures and the improvements that can be made to the architecture or the training process. In so doing, we may come across biases in data representation, wrong labeling, uneven performance of the model. We may question whether the inductive biases the model exploits are indeed representative of the generalization capabilities we claim the model to have. We may think of failure, of data, of ethics. Let us unwrap and take a step back. To do so, in this series of three posts, we analyze and reflect on articles published recently in the ML research community.
This first post discusses some deployment failures of AI systems, and how these failures question the way we approach data to design ML systems. These failures have been analyzed in the article of Raji et al., The Fallacy of AI Functionality.
The second post will take a more holistic perspective on data creation and expectations we may place onto ML approaches, discussing the article of Paullada et al., Data and its (dis)contents: A survey of dataset development and use in machine learning research.
The third post will reflect on how we should question our ML education practices to contribute alleviating the current AI ethics crisis, analyzed in the article of Raji et al., You Can’t Sit With Us: Exclusionary Pedagogy in AI Ethics Education.
1. On some deployment failures of ML systems in high-stake scenarios
AI systems that have been deployed frequently do not work, and this is too often overlooked in discussions around the risks entailed by using these systems, how to certify them, how to regulate their deployment. The validity of AI model evaluation must be critically considered, and these two issues have been analyzed in two recent articles: Raji et al., The Fallacy of AI Functionality [1], and Liao et al., Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning [2].
AI systems have already been deployed in high stakes scenarios, failing with dire consequences. In [1], Raji et al. draw a non-exhaustive list. AI-based detectors of unemployment benefit fraud left (innocent) persons without revenue [3]. AI-based resource allocation systems cutting the number of hours of home assistance to a patient in a wheelchair suffering of cerebral palsy almost by half [4]. These failures do no evenly impact all sociodemographic groups, but overly discriminate underprivileged communities. The afro-american community has been overly targeted by failures of systems used to identify criminals and predict recidivism rate [5, 6]. Low-income people have been falsely flagged as less in need of medical assistance [7], or more likely to commit child-abuse [8, p. 20]. Women have been flagged as less worthy of being recruited [9].
Let us be clear: these systems failed. For child abuse prevention, “the system produced biased outcomes because it significantly oversampled poor children from working class communities, especially communities of color, in effect subjecting poor parents and children to more frequent investigation” [10, 8]. The US health care system has used an algorithm to sort patients to allocate hospital beds. This algorithm was designed with the assumption that health care needs can be quantified with healthcare costs. But less money is spent on Black patients who are equally sick as White patients [7]. This is illustrated in Fig. 1. Women were graded as less worthy of being recruited by the Amazon AI system because it was built on the historical hiring data of the company, which was apparently biased. The system was shown to reject resumes on the bases of mentioning women colleges and even the word woman [9]. So: what data to use, what assumptions to make to model the problem? Should we try to correct the data? Should these high-stake decisions in the society be automated at all? These are questions asked by the articles reviewed in this series of posts.
In the EU, the 2020 special report on Algorithmic discrimination in Europe: Challenges and opportunities for gender equality and non-discrimination law examines the current gender equality and non-discrimination legislative framework in place in the EU in light of algorithmic discrimination [11]. France is mentioned when discussing algorithmic discrimination in education, where the report underlines the concerns voiced when the Parcoursup system was introduced to allocate high-school graduates to higher-education institutions, and in policing and fraud detection where the algorithm to detect fraud for social benefits has been criticized for targeting people born outside of Europe [12, p.20]. French municipalities are also under heavy pressure to deploy face recognition technologies [13], but public and citizen organizations try to raise the alarm on the failure rates, and the ethical and social risks entailed by such technologies [14, 15]. In 2022, the European Data Protection Board published guidelines on the use of face recognition technologies in law enforcement, reiterating its call for a ban on the use of facial recognition technology in certain cases, including for remote identification in publicly accessible spaces, for classification into ethnicity, gender, political or sexual orientation groups, for emotion recognition, for law enforcement cross-referencing camera data with facial images available online [16].
Raji and Fried in [17] surveyed the evaluation of facial recognition systems, whose failures and structural weaknesses were first exposed by the foundational work of Buolamwini and Gebru in [18]. They have shown that the accuracy of gender classification performed by commercial systems in 2018 on face images had an accuracy that was 30% lower for darker skinned women than lighter skinned men. Other audits confirmed systematic failures of facial recognition and verification systems, including the Amazon Rekognition system then deployed by several local polices in the US and whose discriminating consequences where already spotted by the ACLU (see [8, 17] and references therein). This system has been shown to falsely match congress members and athletes to recorded faces of criminals. The system was finally withdrawn. And there are more cases of over-80% failure rates of deployed face processing systems. As Raji and Fried mention, “Yet despite the growing public awareness of the practical failure of these systems once released in the real world, academic studies continue to report near perfect performance of facial recognition systems on benchmark datasets.” We therefore need to understand the apparent divergence between performance reported in an academic setting, and those actually obtained when deployed in the real-world.
In their article Are we learning yet? A meta-review of evaluation failures across ML [2], Liao et al. survey 107 papers in computer vision, natural language processing, recommender systems and more, observing and proposing a taxonomy of failure modes. They distinguish between internal issues, related to the model’s failures on the same dataset it has been trained on, and external issues, related to additional failures when a model is used on another setting involving another dataset, issues that can often be due to the misalignment between the distributions of both datasets. This misalignment often manifests in the under-representation of specific regions of the input space. As summarized by Paullada et al. [19], important examples are the over-representation of lighter-skinned faces in face analysis datasets, over-representation of objects from western countries in datasets used for object recognition, or over-representation of male pronouns and male-coded names in datasets used for named entity recognition.
2. Is augmenting data possible and sufficient?
The first and obvious proposition to counter-balance such problematic and failure-causing under-representations of marginalized groups has been to augment the data and make it more inclusive. An early attempt in this direction had been made by IBM with its “Diversity in Faces” (DiF) dataset, claimed to ensure fairness and accuracy in face recognition. As explained by Crawford and Paglen [20], this dataset comprises close to a million images collected from Yahoo! Flickr Creative Commons dataset so as to achieve an even representation of categories of gender, skin tone, age, and so-called “facial structure”. Indeed, the IBM DiF team considered that the first three categories were not sufficient to capture the diversity of human faces, and set to consider categories of facial symmetry and skullshape. These categories are highly questionable, recalling similar characterizations considered in the fallacious field of craniometry developed in the nineteenth century to attempt to establish biological determinism to intelligence and superiority of certain groups, split along racial and gender lines.
According to Crawford and Paglen [20], this example reveals the political acts often left implicit in making a dataset when we:
- choose the few categories into which we divide a continuous world,
- decide who is in charge of every data sample into each category, who supervises the annotation processes,
- attempt to quantify diversity and choose a certain formula of fairness.
Such political acts are pervasive and can be again exemplified with two major datasets. In 2018¹, Crawford and Paglen [20] performed a manual inspection of the “person” category of the reference computer vision dataset ImageNet used in particular to train object recognition models. They uncovered alarming sub-categories with racist, sexist and otherwise problematic labels such as “slut, boozer, soaker, ball-breaker, mulatto, redneck”. As recalled by Raji and Fried [17], the CelebA dataset contains 40 binary attributes to be annotated by clickworkers from Amazon Mechanical Turk, including problematic ones such as ”double chin”, ”pointy nose”, ”narrow eyes”, ”big lips”, or ”attractive”. These labels are problematic because, in addition to being inherently subjective and having been historically employed for racist, antisemitic or sexist classification, they are implicitly defined in reference to a certain norm, likely white, slim and Caucasian. This norm is embedded into the dataset creation process, hence reflecting power relationships at play. Similarly, Denton et al. analyze the creation process of ImageNet [21], and underline that some labels reflect “a view that associates ’bikinis’ with women, ’sports’ with men”, but also “’trout’ with fishing trophies, and ’lobsters’ with dinner”. This reflects a “white western male gaze” [21], and these attire, activities and animals could be described differently from other social standpoints.
In their article investigating the ethical concerns of facial recognition auditing, Raji, Gebru, Mitchell, Buolamwini, Lee and Denton describe two types of ethical tensions that arise from the very will to audit and possibly improve face recognition systems [22].
The first ethical tension is how defining multiple group or label categories to better analyze and improve fairness can fail to capture intersectionality and hurt fairness over other groups. Specifically, the concept of intersectionality has been introduced by Kimberle Crenshaw [23] as a “framework for understanding how interlocking systems of power and oppression give rise to qualitatively different experiences for individuals holding multiply marginalized identities” [22]. For example, attempting to define labels for the nebulous concepts of racial categories which are social constructs, or binary gender labels, can be exclusionary and problematic when determined by the person’s appearance. Also, improving fairness over a set of connected groups (for example over gender groups) may hurt fairness between others (e.g., age groups) [24].
The second ethical tension is between privacy and representation. Major massive datasets have used predatory data collection practices where images were collected without the consents of the depicted individuals. Prominent examples include the above mentioned ImageNet [20, 21] and CelebA datasets, or the Microsoft MS-CELEB dataset of 10M photos of about 100000 so-called celebrities scraped from the Internet, representing actors and politicians, but also journalists and rights activists opposing surveillance and facial recognition — this dataset has been taken down in 2019. Recently in the EU, data protection authorities fined Clearview AI for such predatory and unlawful, in regard to the GDPR, data collection practices (France [25], Italy [26], the UK [27]). Privacy and data protection rights are therefore major challenges, and Raji et al. [22] show how seeking to increase the number of data samples of under-represented groups in turn disproportionately increases the privacy risk for them: being less photographed than others [28, 29], the probability for an image to be included in a dataset is higher than for individuals from over-represented groups. Let us remember that groups under-represented in datasets significantly overlap with socially underprivileged groups.
The above discussion on under-representation is theorized with the concept of biases, that we briefly introduce next to expose the structural unfairness of human data, which calls for a broader perspective in ML practices.
3. Data biases and human biases
In AI ethics, bias is defined as “the prejudice of an automated decision system towards individuals or groups of people on the basis of protected attributes like gender, race, age, and more” [29]. From a statistical point of view, biases can correspond to data features having different marginal (presence) or conditional (in connection with other data features) distributions. In [29], Frabbrizzi et al. categorize biases in visual datasets into selection bias (which subjects are included in a dataset), framing bias (how the subjects are represented) and label bias (errors or disparities in the labeling data). For example for the Open Images Dataset containing approximately 9 million images, Schumann et al. in [30] show how societal norms impact how different persons appearing in the same image get annotated or not, depending on the context of the image (for example, women may often be considered not the focus depending on the context). To facilitate the inspection of biases in visual datasets, Wang et al. in [28] present the REVISE tool enabling the analysis of the portrayal of people. For example, in the MS-COCO dataset, they show that women and persons with darker skin tones are less likely to take a large area of the image. Persons with darker skin tones tend to appear more in outdoor transportation scenes, while women tend to appear more in indoor scenes like shopping and dining while men appear in more outdoor scenes related to sports and vehicles categories. As illustrated in Fig. 2, they even show that the less people appear in gender-traditional roles, the less their gender is correctly classified. As aforementioned, similar biases are present in NLP datasets.
Biases in datasets therefore reproduce socially-rooted prejudice faced by dominated groups in the society and based in particular (but not only) on gender and race psycho-sociological constructs. The connections between data, politics and the power relationships have been exposed in, e.g., [20, 21, 31]. The notion of bias has been formalized in cognitive and social psychology [32]. In the human brain, our semantic memory network functions through associations between concepts, and the term bias denotes cases of such associations that are problematic: we are faster to think savanna when asked about lions, but we are also generally faster to think men when asked about science. These biases are created by recurrent exposure to situations associating these concepts, and in particular in today’s
world by multimedia contents. The Implicit Association Test presented by Greenwald et al. [32], which is widely recognized in social psychology, enables to quantify these differences in implicit associations by measuring reflex times. Several associations can be probed, such as between age and pleasantness, sexuality (gay or straight) and pleasantness, Arab-Muslim and pleasantness, gender and science, gender and career, yielding scores of stereotypicality in populations [33]. Strikingly, it has been shown that language and vision ML models trained on large-scale data learn similar biased associations between concepts: the distances between the vectorized latent representations of words [34] or images [35] related to the same pairs of concepts reflect the IAT scores of tested populations [33].
4. Conclusion
To design better ML approaches and systems, the sole direction of augmenting and correcting data to attempt to free it from biases therefore seems structurally limited. In the next post, we will discuss this question, which has been the subject of heated debates in the ML community.
by Lucile Sassatelli, Full Professor in Computer Science at Université Côte d’Azur, Junior fellow of Institut Universitaire de France, Scientific Director of EFELIA Côte d’Azur
References
[1] Inioluwa Deborah Raji, I. Elizabeth Kumar, Aaron Horowitz, and Andrew Selbst, “The Fallacy of AI Functionality,” in 2022 ACM Conference on Fairness, Accountability, and Transparency, Seoul Republic of Korea, June 2022, pp. 959–972, ACM.
[2] Thomas Liao, Rohan Taori, Inioluwa Deborah Raji, and Ludwig Schmidt, “Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
[3] Robert N. Charette, “Robo-adjudication and fake fraud reports [spectral lines],” IEEE Spectrum, vol. 55, no. 3, pp. 6–6, 2018.
[4] Colin Lecher, “A healthcare algorithm started cuttin gcare, and no one knew why,” https://www.theverge.com/2018/3/21/17144260/healthcare-medicaid-algorithm-arkansas-cerebral-palsy, Mar. 2018.
[5] J. Angwin, J. Larson, S. Mattu, and L. Kirchner, “Machine Bias: There’s software used across the country to predict future criminals. And it’s biased against blacks.,” https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing, 2016.
[6] J. Larson, S. Mattu, L. Kirchner, and J. Angwin, “How we analyzed the compas recidivism algorithm,” https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm, 2016.
[7] Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan, “Dissecting racial bias in an algorithm used to manage the health of populations,” Science, vol. 366, no. 6464, pp. 447-453, 2019.
[8] Meredith Whittaker, Kate Crawford, Roel Dobbe, Genevieve Fried, Elizabeth Kaziu, Varoo Mathur, Sarah Myers West, Rashida Richardso, Jaso Schultz, and Oscar Schwartz, “AI Now Report 2018,” https://ainowinstitute.org/AI_Now_2018_Report.pdf, Dec. 2018.
[9] Jeffrey Dastin, “Amazon scraps secret ai recruiting tool that showed bias against women,” https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G, Oct. 2018.
[10] Virginia Eubanks, Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor, New York: St. Martin’s Press, New York, 2018.
[11] European Commission. Directorate General for Justice and Consumers. and European network of legal experts in gender equality and non discrimination., Algorithmic discrimination in Europe: challenges and opportunities for gender equality and non discrimination law., Publications Office, LU, 2021.
[12] Defender of Rights, “Lutte contre la fraude aux prestations sociales : à quel prix pour les droits des usagers ?, (Fighting social benefits fraud: what price for users’ rights?),” 2017.
[13] Clément Le Foll and Clément Pouré, “Des algorithmes au coin de la rue, ou le nouveau business de la vidéosurveillance automatisée,” https://www.mediapart.fr/journal/france/080522/des-algorithmes-au-coin-de-la-rue-ou-le-nouveau-business-de-la-videosurveillance-automatisee, May 2022.
[14] “Le vrai visage de la reconnaissance faciale,” https://www.laquadrature.net/2019/06/21/le-vrai-visage-de-la-reconnaissance-faciale/, June 2019, Section: Surveillance.
[15] “Reconnaissance faciale : pour un débat à la hauteur des enjeux | CNIL,” https://www.cnil.fr/fr/reconnaissance-faciale-pour-un-debat-la-hauteur-des-enjeux, Nov. 2019.
[16] European Data Protection Board, “Guidelines 05/2022 on the use of facial recognition technology in the area of law enforcement,” https://edpb.europa.eu/our-work-tools/documents/public-consultations/2022/guidelines-052022-use-facial-recognition_en, June 2022.
[17] Inioluwa Deborah Raji and Genevieve Fried, “About Face: A Survey of Facial Recognition Evaluation,” https://arxiv.org/abs/2102.00813, 2021.
[18] Joy Buolamwini and Timnit Gebru, “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification,” in Proceedings of the 1st Conference on Fairness, Accountabilityand Transparency, Sorelle A. Friedler and Christo Wilson, Eds. Feb. 2018, vol. 81 of Proceedings of Machine Learning Research, pp. 77–91, PMLR.
[19] Amandalynne Paullada, Inioluwa Deborah Raji, Emily M. Bender, Emily Denton, and Alex Hanna, “Data and its (dis)contents: A survey of dataset development and use in machine learning research,” Patterns, vol. 2, no. 11, pp. 100336, Nov. 2021.
[20] Kate Crawford and Trevor Paglen, “Excavating ai: the politics of images in machine learning training sets,” AI & SOCIETY, 06 2021.
[21] Emily Denton, Alex Hanna, Razvan Amironesei, Andrew Smart, and Hilary Nicole, “On the genealogy of machine learning datasets: A critical history of ImageNet,” Big Data & Society, vol. 8, no. 2, pp. 205395172110359, July 2021.
[22] Inioluwa Deborah Raji, Timnit Gebru, Margaret Mitchell, Joy Buolamwini, Joonseok Lee, and Emily Denton, “Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing,” New York, p. 7, 2020.
[23] Kimberle Crenshaw, “Demarginalizing the intersection of race and sex: A black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics,” University of Chicago Legal Forum, vol. 1989, no. 8, 1989.
[24] Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu, “Preventing fairness gerrymandering: Auditing and learning for subgroup fairness,” in Proceedings of the 35th International Conference on Machine Learning, Jennifer Dy and Andreas Krause, Eds. 10–15 Jul 2018, vol. 80 of Proceedings of Machine Learning Research, pp. 2564–2572, PMLR.
[25] “Facial recognition: 20 million euros penalty against CLEARVIEW AI | CNIL.” https://www.cnil.fr/en/facial-recognition-20-million-euros-penalty-against-clearview-ai (accessed Nov. 23, 2022).
[26] “Facial recognition: Italian SA fines Clearview AI EUR 20 million | European Data Protection Board.” https://edpb.europa.eu/news/national-news/2022/facial-recognition-italian-sa-fines-clearview-ai-eur-20-million_en (accessed Nov. 23, 2022).
[27] “The walls are closing in on Clearview AI,” MIT Technology Review. https://www.technologyreview.com/2022/05/24/1052653/clearview-ai-data-privacy-uk/ (accessed Nov. 23, 2022).
[28] Angelina Wang, Alexander Liu, Ryan Zhang, Anat Kleiman, Leslie Kim, Dora Zhao, Iroha Shirai,Arvind Narayanan, and Olga Russakovsky, “Revise: A tool for measuring and mitigating bias invisual datasets,” International Journal of Computer Vision, vol. 130, no. 7, pp. 1790–1810, Jul 2022.
[29] S. Fabbrizzi, S. Papadopoulos, E. Ntoutsi, and I. Kompatsiaris, “A survey on bias in visual datasets,” Computer Vision and Image Understanding, vol. 223, p. 103552, 2022, doi: https://doi.org/10.1016/j.cviu.2022.103552.
[30] Candice Schumann, Susanna Ricco, Utsav Prabhu, Vittorio Ferrari, and Caroline Pantofaru, “A Step Toward More Inclusive People Annotations for Fairness,” in Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, July 2021, pp. 916–925, arXiv:2105.02317[cs].
[31] Ruha Benjamin, Race After Technology: Abolitionist Tools for the New Jim Code, Wiley, 2019.
[32] A. G. Greenwald, D. E. McGhee, and J. L. Schwartz, “Measuring individual differences in implicitcognition: the implicit association test.,” Journal of personality and social psychology, vol. 74, no. 6, pp. 1464–1480, June 1998, Place: United States.
[33] Brian A. Nosek, Frederick L. Smyth, N. Sriram, Nicole M. Lindner, Thierry Devos, Alfonso Ayala, Yoav Bar-Anan, Robin Bergh, Huajian Cai, Karen Gonsalkorale, Selin Kesebir, Norbert Maliszewski, Félix Neto, Eero Olli, Jaihyun Park, Konrad Schnabel, Kimihiro Shiomura, Bogdan Tudor Tulbure, Reinout W. Wiers, Mónika Somogyi, Nazar Akrami, Bo Ekehammar, Michelangelo Vianello, Mahzarin R. Banaji, and Anthony G. Greenwald, “National differencesin gender–science stereotypes predict national sex differences in science and math achievement,” Proceedings of the National Academy of Sciences, vol. 106, no. 26, pp. 10593–10597, 2009.
[34] Aylin Caliskan, Joanna J. Bryson, and Arvind Narayanan, “Semantics derived automatically from language corpora contain human-like biases,” Science, vol. 356, no. 6334, pp. 183–186, 2017.
[35] Ryan Steed and Aylin Caliskan, “Image Representations Learned With Unsupervised Pre-Training Contain Human-like Biases,” in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event Canada, Mar. 2021, pp. 701–713, ACM.