Is data fixable? On the need of socially-informed practices in research and education (part 2)
Part 2: A more holistic perspective on data creation and expectations
As discussed in the previous post, to design better ML approaches and systems, the sole direction of augmenting and correcting data to attempt to avoid biases seems structurally limited. This question has been the subject of heated debates in the ML community, such as the Twitter feud¹ between Timnit Gebru (then technical co-lead of the ethical AI team at Google) and Yann LeCun (then chief AI scientist for Facebook AI Research and co-recipient of the Turing award). LeCun restricted the problem of bias in a computer vision system to a purely technical issue of “data poverty,” while Gebru advocated a broader perspective, arguing that diversifying the datasets was not sufficient and that we ML researchers “can’t ignore social and structural problems,” referring to works in critical race and technology theory [1, 2, 3].
Every dataset therefore involves humans: those who decide the target task, those who decide how to collect data samples, those who decide the annotation guidelines, those who decide who annotates, those who are assigned the annotation work, those whose personal data is used. This sheer fact yields limitations and biases in every dataset, however massive it may be. As Paullada et al. develop it in their article Data and its (dis)contents: A survey of dataset development and use in machine learning research [4], “prevailing data practices tend to abstract away the human labor, subjective judgments and biases, and contingent contexts involved in dataset production. However, such details are important for assessing whether and how a dataset might be useful for a particular application, for enabling more rigorous error analysis, and for acknowledging the significant difficulty required in constructing useful datasets.”
To understand the connection between our ML research practices and the social and structural problems pervading datasets, let us introduce the guidelines of Paullada et al. for ML practitioners when we: (1) define a problem to be tackled with ML, (2) create or choose existing data to use, and (3) analyze the model performance and envision real-world deployment.
¹https://syncedreview.com/2020/06/30/yann-lecun-quits-twitter-amid-acrimonious-exchanges-on-ai-bias/
1. Define a problem to be tackled with ML
The first choice made by the human practitioner is the task. According to Schlangen [5], tasks can be defined abstractly (“intensionally”) as a problem statement (e.g., object recognition, speech-to-text translation) or “extensionally”, that is instantiated by a learning problem made of a dataset of (input, output) pairs and an evaluation metric (e.g., top-1 accuracy) [6]. Such formalized distinction between an abstract task definition, and a task definition depending on a dataset, is key to adopting a critical perspective on the soundness of the defined task and later analyze model generalization.
One must first analyze the intensional definition of the task and the mapping we can foresee between input and output. This is not always a meaningful mapping, and we should reflect on what this could be before any data collection, ML model training and performance analysis. Numerous examples of ML approaches have been published to predict attributes such as sexuality or hireability from persons’ faces. What assumptions can underlie this kind of input-output mapping? Assumptions that a biometric feature such as face appearance is connected to sexual behaviors or professional expertise.
These are problematic pseudosientific tasks resting on claims of “essentialism of human traits” [4]. Such reflection must be made beforehand and involves human values. Crucially, even though ML models have been trained on such tasks and the analysis of their performance may show some apparent learnability of the task, we must analyze what elements the models rely on for prediction. They may unsurprisingly rely on stereotypical artifacts such as face grooming or home decor [7, 8]. Then what use are such models? The ML practitioner must critically examine their own intentionality through reflective thinking. Another example is the GermEval2020 Task 1, where the inputs were students’ short answer text and output was their IQ score. The responsibility of the organizers in setting such a problematic task lies in (i) legitimizing the IQ score as a sensible quantity, (ii) predicting IQ with an ML approach, and (iii) assuming IQ can be predicted from short text answers. As ill-defined datasets can enable ML models to find shortcuts making ethically dubious questions appear answerable, Jacobsen emphasizes that “When assessing whether a task is solvable, we first need to ask: should it be solved? And if so, should it be solved by AI?” [9, 4].
One must be aware and decide consciously whether they must simply stop pursuing a direction when the intensional description of the task makes a learning problem solvable only by exploiting problematic biases in dataset. Predicting recidivism must serve as a cautionary tale. In 2016, ProPublica showed that the private and closed COMPAS system deployed within the US justice system was biased against Blacks: “Black defendants were also twice as likely as white defendants to be misclassified as being a higher risk of violent recidivism. And white violent recidivists were 63 percent more likely to have been misclassified as a low risk of violent recidivism, compared with black violent recidivists.” Race was not an input to the system, which relied on a set of 137 questions including asking defendants: “Was one of your parents ever sent to jail or prison?” “How many of your friends/acquaintances are taking drugs illegally?” and “How often did you get in fights while at school?” [10]. As scientist, we must dare to ask: Should recidivism be predicted? Should recidivism be predicted to inform justice decisions towards individuals? This calls for legal domain expertise. If we aim at giving a person equal chances regardless of their social origin, are there acceptable features on which to base recidivism prediction?
Pre-employment assessment must also serve as a cautionary tale. Raghavan et al. have reviewed the claims of objectivity and “fixed bias” of the numerous products used to screen candidates to employment [11] (see Fig. 1). These systems take as input answers to questions, game sessions played by the candidate, or a short video of the candidate. Beyond the severely questionable connection discussed above between visual features and future job performance, and summarized by Arvind Narayanan in his talk [12], a crucial concern is equity between candidates. Indeed, as pointed out in [11]:
“Cognitive assessments have imposed adverse impacts on minority populations since their introduction into mainstream use. Critics have long contended that observed group differences in test outcomes indicated flaws in the tests themselves, and a growing consensus has formed around the idea that while assessments do have some predictive validity, they often disadvantage minorities despite the fact that minority candidates have similar real-world job performance to their white counterparts. The American Psychological Association (APA) recognizes these concerns as examples of “predictive bias” (when an assessment systematically over- or under-predicts scores for a particular group) […] Disparities in assessment outcomes for minority populations are not limited to pre-employment assessments. In the education literature, the adverse impact of assessments on minorities is well-documented. This has led to a decades-long line of literature seeking to measure and mitigate the observed disparities.”
While the consideration in the products of structural disparities in the efficacy of the assessments between sociodemographic groups is analyzed in [11], Naranayan summarizes it as “These systems claim to work by analyzing not even what the candidate said, but rather body language, speech patterns, etc. Common sense tells you this isn’t possible, and AI experts would agree. This product is essentially an elaborate random number generator.” [12].
As ML practitioners, we must therefore exert critical judgement when faced with similar questions. For example, the task predicting individual student success is problematic on the following accounts. If we aim at giving a person equal chances regardless of their social origin, is it desirable to predict individual student success? Can the benefit exceed the negative impacts? Is it possible and on what input features? How can we account for structural adverse impact affecting minorities in education domains? Attempting to define and tackle such a task can enforce social determinism, hide human agency of the ML practitioners behind a deceptive appearance of objectivity, hence impacting the mindset of students in uncontrollable ways, and finally be used to allocate students to places in the higher-education system. For these reasons, AI systems used in the administration of justice, to control access to education, and to employment are now classified as high-risk systems in the latest EU AI act regulation [13, Annex III] (see. Fig. 2).
2. Dataset creation
As emphasized by Paullada et al., the involvment of every human in the data life cycle must be acknowledged.
Data collection and compensation. Starting with those whose personal data may be collected, their right to privacy and data protection must be respected. In the previous post, we mentioned the fines of Clearview AI for predactory data collection practices in the EU. In the EU, the GDPR requires the informed consent of the individuals for their data to be collected and used for specific purpose. In their article “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI [14], Sambasivan et al. surveyed 53 AI practitioners involved in high-stakes application deployment (including detection of cancer, suicide, or landslide, in the US, West and East Africa and India). Analyzing the data practices of the participants, Sambasivan et al. identified specific events that caused chains of negative effects which they named Data Cascades. A prominent example trigger of Data Cascades is uncompensated data work. As Sambasivan et al. detail, “high-stakes domains lacked pre-existing datasets, so practitioners were necessitated to collect data from scratch. ML data collection practices were reported to conflict with existing workflows and practices of domain experts and data collectors. Limited budgets for data collection often meant that data creation was added as extraneous work to on-the-ground partners (e.g., nurses, patrollers, farmers) who already had several responsibilities, and were not adequately compensated for these new tasks.” Data practices chosen to create instrumental datasets such as ImageNet failed to recognize the data work performed by click-workers, making their work invisible, overlooking the interpretive work of these humans, wrongly considered as a homogeneous pool invisible behind the interface of platforms such as Amazon Mechanical Turk (AMT) [15].
Human annotation. Such an interpretive process producing annotation depends on the individual’s judgment, context and past experiences. This yields an inherent variability in the annotations of the same data, and must be properly recorded. One example of this variability is Sambasivan et al. detailing that AI practionners lacking the proper domain expertise had to deal with subjectivity in the ground truth data [14]: in the case of credit assessment, “ground truth was inaccurate but deeply embedded into systems”, but the limited data history did not allow to correct the data, hence introducing errors in models. Another example on ImageNet is Recht et al. showing in [16] that, given that only a subset of objects can be annotated on each image, different thresholds on inter-annotator agreement yield significant differences in the object distributions in the dataset, even questioning the capability of classifiers trained on one version of the dataset to generalize to another version of the dataset. Acknowledging the inherent variability in hate speech labeling, Goyal et al. in [17] show that different pools of raters annotate speech toxicity differently depending on their multiple identities (self-declared racial group and sexual orientation). Rather than resorting to click-workers to annotate data with high-level concepts, some have decided to ask experts to produce the labels. This is the case for propaganda annotation now serving as a gold-standard [18], or sexism annotation [19]. Medical datasets are also concerned: differences in time and training of doctors annotating imagery can help understand differences in annotation, otherwise possibly overlooked or leveled by the models. For example, recent studies such as [20, 21] showed that about 60% of cases of pancreatic cancer are missed on imagery interpretation, while most could have been diagnosed with proper guidelines. Such knowledge on how labels are produced by humans must be documented and considered when designing and assessing a ML model.
Documenting the data lifecycle. Documenting the data creation pipeline is therefore crucial and must articulate AI and target domain expertise. Another cause for Data Cascades identified by Sambasivan et al. was that AI practitioners also had to define and find representative data [14]: “Cascades occurred because of a default assumption that datasets were reliable and representative, and application-domain experts were mostly approached only when models were not working as intended. Cascades from non-representative data from poor application-domain expertise manifested as model performance issues, resulting in re-doing data collection and labelling upon long-winded diagnoses.”
To prevent such detrimental, complex and lasting-impact issues, several works have proposed formalizations of the documentation of dataset development [22, 23, 24]. They can include details on the research purposes for which the dataset was created, the mode of collection and composition of the data, the exact annotation guidelines (in order to spot any under-specification, later key to analyze model performance), compensation and demographics of the annotators, testing requirements and test performed or not performed. For example Hutchinson et al. [23] built on software development practices to propose documentation templates for every stage of the dataset lifecycle, including requirements analysis, design, implementation, testing and maintenance. Another template called Data Cards was proposed by Pushkarna et al. [24]. For the Open Images Dataset improved by Schumann et al. [25] and discussed in the previous post, Data Cards templates yield a five-page document.
3. Benchmark practices
In current ML practices, benchmarks are learning tasks defined by a dataset and evaluation metrics, and conceived to measure the progress of the field towards general-purpose capabilities such as object recognition, language understanding, etc. As critically examined by Raji et al. in AI and the Everything in the Whole Wide World Benchmark [26], such research claims of progress depend on how well the experimental setting (the extensional definition of the task) represents the task (in its intensional/data-independent definition). Several recent articles have shown how claims of general human-like capabilities of ML models were erroneous, exposing how the ML models instead converged to exploit shortcuts in datasets, e.g., in pneumonia detection from chest scans [9], in reading comprehension tasks, in Visual Question Answering (VQA) [27]. Two types of approaches have been considered by the ML research community to tackle these issues: (i) making datasets more difficult and representative of the intensional task, and (ii) making models more robust so that they do not rely on distribution biases present in the training set but not in the testing set.
Making datasets more difficult. For example, the datasets for VQA were shown to be particularly prone to biases, making several questions answerable with a high success probability without even looking at the image. To prevent ML models from using these shortcuts and improve their generalization capabilities, a host of methods have considered dataset perturbation and addition of counter factual examples. However, spurious cues can still be present in perturbed versions of dataset. For example for the VQA-CP dataset meant to test models on out-of-distribution sets (i.e., the test set has a different distribution of the patterns connecting the input image and question with the desired output answer), Teney et al. [28] exhibit major conceptual problems of several published models, improving over each other but relying on inappropriate assumptions invalidating their claims of generalization capabilities. They emit recommendations for future models on how to build on the VQA-CP dataset, and for future datasets meant to better test model generalization capabilities, particularly by including several metrics and multidimensional reports on performance. Subsequent perturbations to datasets keep trying to set stronger benchmarks (e.g., [27, 29]). The wider framework is depicted in Fig. 3 taken from Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning by Liao et al. [6].
Robustifying models. Formal approaches to bias also include bias mitigation methods, which can be distinguished between explicit methods, where the protected variables are known, and implicit methods, where the bias variables are not known at training time. These methods are briefly reviewed in [29]. In this article, Shrestha et al. assess robustness of bias mitigation methods for vision models, exhibiting the inability of methods to maintain performance when multiple forms of bias are present and known at training time, and to transfer to unknown forms of bias at test time. The metrics to enforce or assess resilience to bias rely on mathematical definitions of fairness. These quantitative definitions of fairness are widely diverse [30] and often mutually exclusive. Their choice in system deployment can be questioned. For example AI products for pre-employment assessment enforce fairness with the 4/5 rule of thumb, where a group is considered not discriminated against if the ratio of laureates over candidates for this group is above 80% of the same ratio for the other groups. This however does not necessarily translate into equivalent accuracy for all groups, nor into actual fairness considering, e.g., the case where the unprivileged group would have more qualified individuals [11]. Shrestha et al. conclude with “we implore the community to adopt more rigorous assessment of future bias mitigation methods.” [29].
4. Conclusion
Despite data being the bedrock of ML advances, a host of recent articles in ML research hence warn that data challenges are still insufficiently considered [14, 4]. Dataset-related tasks must be prioritized and ML practices approached with a more cautious and complete view to “arrive at datasets that faithfully embody tasks targeting realistic capabilities and that acknowledge the humanity of those represented within the data, in addition to those participating in the process of its creation” [4]. Taking a step back, Raji et al. in their article [26] argue that general-purpose capabilities “cannot be adequately embodied in data-defined benchmarks”. This is echoed by Yehuda et al. in their article It’s Not What Machines Can Learn, It’s What We Cannot Teach [31], where they investigate the conditions on datasets to represent NP-hard problems to train ML models to solve them. They prove that any polynomial-time sample generator can only generate a biased dataset corresponding to easier sub-problems. This exhibits theoretically, in a specific context, the impossibility that can exist when representing an abstractly defined task (intensional definition) with a learning problem instance (extensional definition).
Therefore, the mathematical approaches presented above are instrumental but cannot relieve us from the need for a broader view connecting modeling and formalization to real-world impact, justice and “infrastructural thinking” as developed, e.g, by Whittaker et al. [32]:
“When framed as technical “fixes,” debiasing solutions rarely allow for questions about the appropriateness or efficacy of an AI system altogether, or for an interrogation of the institutional context into which the “fixed” AI system will ultimately be applied. For example, a “debiased” predictive algorithm that accurately forecasts where crime will occur, but that is being used by law enforcement to harass and oppress communities of color, is still an essentially unfair system. To this end, our definitions of “fairness” must expand to encompass the structural, historical, and political contexts in which an algorithmic systems is deployed.”
In the next post, we will review challenges and possible avenues for AI education integrating these aspects of data ethics.
by Lucile Sassatelli, Full Professor in Computer Science at Université Côte d’Azur, Junior fellow of Institut Universitaire de France, Scientific Director of EFELIA Côte d’Azur
References
[1] Timnit Gebru and Emily Denton, “Tutorial on Fairness Accountability Transparency and Ethics in Computer Vision at CVPR 2020,” https://sites.google.com/view/fatecv-tutorial/home, 2020.
[2] Timnit Gebru and Emily Denton, “Beyond Fairness in Machine Learning at NeurIPS 2021,” https://neurips.cc/virtual/2021/tutorial/21889, 2021.
[3] Ruha Benjamin, Race After Technology: Abolitionist Tools for the New Jim Code, Wiley, 2019.
[4] Amandalynne Paullada, Inioluwa Deborah Raji, Emily M. Bender, Emily Denton, and Alex Hanna, “Data and its (dis)contents: A survey of dataset development and use in machine learning research,” Patterns, vol. 2, no. 11, pp. 100336, Nov. 2021.
[5] David Schlanger, “Targeting the benchmark: On methodology in current natural language processing research,” in Proceedings of the 2021 ACL International Joint Conference on Natural Language Processing. Aug. 2021, pp. 670-674, ACL.
[6] Thomas Liao, Rohan Taori, Inioluwa Deborah Raji, and Ludwig Schmidt, “Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning,” in Thirty- fth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
[7] Blaise Aguera y Arcas, “Do algorithms reveal sexual orientation
or just expose our stereotypes?,” https://medium.com/@blaisea/do-algorithms-reveal-sexual-orientation-or-just-expose-our-stereotypes-d998fafdf477, Jan. 2018.
[8] Sarah Knapton, “Why you shouldn’t wear glasses to an interview with a robot,” https://www.telegraph.co.uk/news/2022/10/10/why-shouldnt-wear-glasses-interview-robot/, Oct. 2022.
[9] Jörn-Henrik Jacobsen, Robert Geirhos, and Claudio Michaelis, “Shortcuts: Neural networks love to cheat,” The Gradient, 2020.
[10] J. Angwin, J. Larson, S. Mattu, and L. Kirchner, “Machine Bias: There’s software used across the country to predict future criminals. And it’s biased against blacks.,” https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing, 2016.
[11] Manish Raghavan, Solon Barocas, Jon Kleinberg, and Karen Levy, “Mitigating bias in algorithmic hiring: Evaluating claims and practices,” in Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, New York, NY, USA, 2020, FAT* ’20, p. 469-481, Association for Computing Machinery.
[12] Arvind Narayanan, “How to recognize AI snake oil,” https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf, 2019.
[13] European Commission, Laying down harmonised rules on Artificial Intelligence (Artificial Intelligence Act) and amending certain Union legislative acts, Publications Oce, LU, 2021.
[14] Nithya Sambasivan, Shivani Kapania, Hannah High ll, Diana Akrong, Praveen Paritosh, and Lora M Aroyo, “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI,” in Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, Yokohama Japan, May 2021, pp. 1-15, ACM.
[15] Emily Denton, Alex Hanna, Razvan Amironesei, Andrew Smart, and Hilary Nicole, “On the genealogy of machine learning datasets: A critical history of ImageNet,” Big Data & Society, vol. 8, no. 2, pp. 205395172110359, July 2021.
[16] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar, “Do imagenet classiers generalize to imagenet?,” International Conference on Machine Learning, p. 5389-5400, 2019, Cited by: 95.
[17] Nitesh Goyal, Ian Kivlichan, Rachel Rosen, and Lucy Vasserman, “Is Your Toxicity My Toxicity? Exploring the Impact of Rater Identity on Toxicity Annotation,” Tech. Rep. arXiv:2205.00501, arXiv, May 2022, arXiv:2205.00501 [cs] type: article.
[18] Giovanni Da San Martino, Seunghak Yu, Alberto Barrón-Cedeño, Rostislav Petrov, and Preslav Nakov, “Fine-Grained Analysis of Propaganda in News Article,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 2019, pp. 5635-5645, Association for Computational Linguistics.
[19] Mattia Samory, Indira Sen, Julian Kohne, Fabian Flock, and Claudia Wagner, “Call me sexist, but…” : Revisiting Sexism Detection Using Psychological Scales and Adversarial Samples,” in International AAAI Conference on Web and Social Media (ICWSM), date = 2021.
[20] Jessie D Kang, Sharon E Clarke, and Andreu F Costa, “Factors associated with missed and misinterpreted cases of pancreatic ductal adenocarcinoma,” Eur Radiol, vol. 31, no. 4, pp. 2422-2432, Sept. 2020.
[21] N. Umar, “How often is pancreatic cancer missed on CT or MRI imaging? a novel root cause analysis system to establish the most plausible explanation for post imaging pancreatic cancer,” in Presented at United European Gastroenterology Week https: // ueg. eu/ a/ 307, 2022.
[22] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daume III, and Kate Crawford, “Datasheets for datasets,” Commun. ACM, vol. 64, no. 12, pp. 86-92, nov 2021.
[23] Ben Hutchinson, Andrew Smart, Alex Hanna, Emily Denton, Christina Greer, Oddur Kjartansson, Parker Barnes, and Margaret Mitchell, “Towards accountability for machine learning datasets: Practices from software engineering and infrastructure,” in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, New York, NY, USA, 2021, FAccT ’21, p. 560-575, Association for Computing Machinery.
[24] Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson, “Data cards: Purposeful and transparent dataset documentation for responsible ai,” in 2022 ACM Conference on Fairness, Accountability, and Transparency, New York, NY, USA, 2022, FAccT ’22, p. 1776-1826, Association for Computing Machinery.
[25] Candice Schumann, Susanna Ricco, Utsav Prabhu, Vittorio Ferrari, and Caroline Pantofaru, “A Step Toward More Inclusive People Annotations for Fairness,” in Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, July 2021, pp. 916-925, arXiv:2105.02317 [cs].
[26] Inioluwa Deborah Raji, Emily M. Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna, “AI and the Everything in the Whole Wide World Benchmark,” Nov. 2021, arXiv:2111.15366 [cs] type: article.
[27] Corentin Kervadec, Grigory Antipov, Moez Baccouche, and Christian Wolf, “Roses are Red, Violets are Blue. . . But Should VQA expect Them To?,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, June 2021, pp. 2775-2784, IEEE.
[28] Damien Teney, Ehsan Abbasnejad, Kushal Kafle, Robik Shrestha, Christopher Kanan, and Anton van den Hengel, “On the value of out-of-distribution testing: An example of goodharts law,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, Eds. 2020, vol. 33, pp. 407-417, Curran Associates, Inc.
[29] Robik Shrestha, Kushal Kafle, and Christopher Kanan, “An investigation of critical issues in bias mitigation techniques,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), January 2022, pp. 1943-1954.
[30] Ben Hutchinson and Margaret Mitchell, “50 years of test (un)fairness: Lessons for machine learning,” in Proceedings of the Conference on Fairness, Accountability, and Transparency, New York, NY, USA, 2019, FAT* ’19, p. 49-58, Association for Computing Machinery.
[31] Gal Yehuda, Moshe Gabel, and Assaf Schuster, “It’s not what machines can learn, it’s what we cannot teach,” in Proceedings of the 37th International Conference on Machine Learning. 2020, ICML’20, JMLR.org.
[32] Meredith Whittaker, Kate Crawford, Roel Dobbe, Genevieve Fried, Elizabeth Kaziu, Varoo Mathur, Sarah Myers West, Rashida Richardso, Jaso Schultz, and Oscar Schwartz, “AI Now Report 2018,” Dec. 2018.