Academic freedom, academic integrity, and ethical review in NLP

Emily M. Bender
21 min readJun 4, 2021

--

This blog post is written in response to a certain kind of pushback inspired by the recent development of ethical review processes at ACL conferences (as well as other related conferences such as NeurIPS). The particular kind of pushback I’m addressing here imposes a very narrow frame on the processes and goals of ethical review and speaks in terms of ‘censorship’ or ‘academic freedom’. In this blog post, I explore the ways in which this frame is limiting, and ultimately, looks like power seeking to protect power.

In brief, academic freedom is the freedom to pursue research in ways that challenge power. And academic freedom comes with the responsibility of academic integrity. When a professional organization (such as the ACL) institutes practices of ethics review, this is an instance of the organization exercising its academic freedom to raise the standards of academic integrity within the field it represents.

Background

In this post, I take as a case study a series of papers in ACL venues over the past few years. The first is a paper from EMNLP 2019 proposing a specific task and approach to that task. This paper prompted a reflections/position paper in ACL 2020, which in turn prompted a response (from yet a third set of authors) to appear in Findings of ACL 2021. It is in this Findings paper that the ‘academic freedom’ frame is invoked. I ask readers to bear with me through this background, as I find it helps make the discussion concrete.

Chen et al published ‘Charge-Based Prison Term Prediction with Deep Gating Network’ at EMNLP 2019. They present a dataset for the task of ‘charge-based prison term prediction’ and an initial system for approaching the task. In their task, the input is a case description, specifically an “accusation by the procuratorate” (p.6363) and a corresponding set of charges extracted from that case description via regular expressions; the output is the prison terms, in months, associated with the charges in the input. The data are drawn from published records of the Supreme People’s Court of China (and thus presumably are in Mandarin, though this isn’t stated in the paper). The abstract of the paper begins “Judgment prediction for legal cases has attracted much research efforts for its practice use, of which the ultimate goal is prison term prediction,” (p.6362) though the only discussion of any use case for prison term prediction is in Section 5 (“ethical considerations”) which also begins with the assertion (p.6366):

Although the research on prison term prediction has considerable potential to improve efficiency and fairness in criminal justice, there are certain ethical concerns worth discussions.

Specifically, the use case appears to be providing an independent check in a phase of the proceedings where judgments are reviewed (p.6366):¹

In practice, we recommend deploying our system in the “Review Phase”, where other judges check the judgment result by a presiding judge. Our system can serve as one anonymous checker.

Chen et al are not the first authors to work in this domain. Other relevant work includes: Xiao et al 2018 ‘CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction’, Zhong et al 2018 ‘Legal Judgment Prediction via Topological Learning’, and Hu et al 2018 ‘Few-Shot Charge Prediction with Discriminative Legal Attributes’.

Leins et al published ‘Give Me Convenience and Give Her Death: Who Should Decide What Uses of NLP are Appropriate, and on What Basis?’ a reflection piece about ethics and NLP, using Chen et al’s paper as a case study, at ACL 2020. They frame their paper as follows (p.2908):

The primary question we attempt to address in this paper is on what basis a given paper satisfies basic ethical requirements for publication, in addition to examining the related question of who should make this judgement.

They then go on to raise concerns about Chen et al 2019, specifically around data ethics and the use case of the technology. The data ethics concerns include privacy concerns for the data subjects (people whose cases are included in the dataset) and in particular the issue of what happens if the underlying data is updated (say a conviction is voided). In such cases, harm could be done to exonerated individuals who are still identified as having been convicted in the secondary dataset. Their use case concerns come under the heading “dual use”, but they write (p.2910):

All of that said, for the case under consideration it is not primarily a question of dual use or misuse, but rather its primary use: if the model were used to inform the Supreme Court, rather than automate decision-making, what weight should judges give the system? And what biases has the model learned which could lead to inequities in sentencing? It is arguable that decisions regarding human freedom, and even potentially life and death, require greater consideration than that afforded by an algorithm, that is, that they should not be used at all.

Despite the very provocative title they give to their paper, they don’t provide a specific recommendation regarding Chen et al or similar papers, but rather end with the recommendations that further questions be considered (p.2912):

What could an ethics assessment for ACL look like? Would an ethics statement for ACL be enough to address all concerns? As argued above, it is not clear that ACL should attempt to position itself as ethical gatekeeper, or has the resources to do so. And even if ACL could do so, and wanted to do so, the efficacy of ethics to answer complex political and societal challenges needs to be questioned (Mittelstadt, 2019).

… and that the ACL consider requiring documentation along the lines of data statements (Bender & Friedman 2018) or datasheets (Gebru et al 2018, updated 2020).

As Leins et al discuss, EMNLP 2020 was the first ACL conference to publish its call for papers after the ACL adopted the ACM’s Code of Ethics (on March 5, 2020). This is reflected in the EMNLP call for papers as follows:

NEW: Ethics Policy

Authors are required to honour the ethical code set out in the ACM Code of Ethics. The consideration of the ethical impact of our research, use of data, and potential applications of our work has always been an important consideration, and as artificial intelligence is becoming more mainstream, these issues are increasingly pertinent. We ask that all authors read the code, and ensure that their work is conformant to this code. Where a paper may raise ethical issues, we ask that you include in the paper an explicit discussion of these issues, which will be taken into account in the review process. We reserve the right to reject papers on ethical grounds, where the authors are judged to have operated counter to the code of ethics, or have inadequately addressed legitimate ethical concerns with their work

And EMNLP was the first ACL conference to have an ethics committee, led by Dirk Hovy and Karën Fort. (I served as a member of that committee.) NAACL 2021 (ethics chairs: Karën Fort and myself), ACL 2021 (ethics chairs: Malvina Nissim, Min-Yen Kan and Xanda Schofield), and EMNLP 2021 (ethics chairs: Margot Mieskes and Christopher Potts) followed suit. (EACL 2021 didn’t include anything about the code of ethics in their call for papers, nor, to my knowledge, convene an ethics committee.) For NAACL 2021, we had enough lead time to provide guidance to authors ahead of time, and ACL 2021 has done similarly. Some reflections on the process for NAACL can be found here.

Tsarapatsanis & Aletras 2021 (to appear in Findings of ACL 2021) ‘On the Ethical Limits of Natural Language Processing on Legal Text’ is a response to Leins et al 2020. Tsarapatsanis & Aletras argue that ‘academic freedom’ should be considered as a value in any decisions, and balanced against e.g. privacy considerations of data subjects. They further argue that the diversity of value systems represented within the global NLP community means that for any particular issue (though they use privacy as an example), the community should default to the most permissive position. Finally, they describe what they call the “threat of moralism” in NLP. They define “moralism” as follows (preprint p.7–8):

Moralism can be intuitively understood as ‘the vice of overdoing morality’ (Coady, 2015). In the context of qualitative research, Hammersley and Traianou (2011) has contended that moralism can take two different forms. First, it might involve the belief that substantive ethical values, other than the disinterested pursuit of knowledge for its own sake, should be integral goals of research. Second, it might involve the requirement that researchers adhere to ‘high’ or even the ‘highest possible’ ethical standards (Hammersley and Traianou, 2011).

Regarding Leins et al’s specific case study of Chen et al 2019, they set the issue up as “decisions about the ethics of legal NLP research” (p.5) or “the ethical acceptability of the research practices” (p.7), and argue that in the specific case of Chen et al 2019, the particular concerns (data ethics, primary use case/dual use) that Leins et al bring up are not significant enough to outweigh the value of ‘academic freedom’. In particular, regarding privacy, Tsarapatsanis & Aletras argue that because the Supreme Court of China has already published the data, any ethical breach regarding privacy accrues to the Court and not to the researchers who collect and distribute the dataset. Furthermore, they argue that transparency of judicial decisions can be seen as a virtue, “so as to control through public scrutiny the exercise of state coercion on individuals” (preprint p.7). But this is a naïve, binary view of privacy: the data is public or not, which doesn’t take into consideration nuances such as ease of search (should someone turn up court records when just searching for a person’s name in another context?) and, very importantly, change to the dataset over time. As noted above, what happens if the Supreme Court updates its records, but that information isn’t propagated to the NLP dataset? (See also Peng 2020.)

Tsarapatsanis & Aletras also dismiss the likelihood of harm to defendants from the creation and distribution of this dataset: “the probability of these cases reopening together with that of some judge being specifically influenced by the dataset if such reopening occurs is practically non-existent” (preprint, p.4) This strikes me as an extremely narrow understanding of the possible harms. Surely risks to an individual aside from risks of legal action exist! (For example, take the case of someone who has been convicted, served their sentence, and has reentered society. Should web searches on their name perpetually turn up links to their prior conviction? Should NLP researchers be the ones to determine that? My answers are, respectively: probably not and definitely not.)

Regarding the dual use/primary use case concerns, Tsarapatsanis & Aletras contend that (preprint p.4):

Leins et al. (2020) overestimate the dangers of an algorithm designed by academics being used to decide real cases with adverse consequences for real people. In particular, Leins et al. (2020) provide no reason to worry that any such use might happen anytime soon, nor evidence that there is, for example, a serious standing intention on the part of Chinese authorities to implement what would amount to a radical reform of the judicial system.

Claiming that no one is likely to actually deploy the results of a study in the real world strikes me as a flimsy basis on which to defend ethically problematic work. Furthermore, given all of the marketing around ‘AI’ ‘solutions’ we are presently seeing, all of the venture capital (and other funding) being thrown at them, not to mention OpenAI’s licensing GPT-3 to Microsoft and Google’s announced intention to replace search with conversational systems, I’d say it would behoove us not to underestimate the likelihood of any given ‘AI’ ‘solution’ being put into practice.

Having thus minimized any possible harms coming from the Chen et al study, Tsarapatsanis & Aletras hold them up against the value of ‘academic freedom’ and find them wanting. They recommend (preprint p.8):

The primary moral duty of legal NLP researchers, like all researchers, is to the disinterested pursuit of truth as they understand it, and not to substantive ends which are extrinsic to that pursuit.

Tsarapatsanis & Aletras are not alone in taking this defensive view of discussions around ethical considerations in NLP (for legal applications or otherwise). As just one additional example, there is the case of the 2020 GermEval Shared Task 1, initially presented as “Prediction of Intellectual Ability and Personality Traits from Text” but reframed (without substantive change to the task) as “Classification and Regression of Cognitive and Motivational style from Text” after criticism from me and others.

The SwissText/KONVENS conference (host of the shared GermEval shared tasks) organized a panel on the topic of the ethics of this task, kicked off with a presentation by Michele Loi entitled “How can knowledge be harmful?” This presentation also took the view that asking questions like “Can we predict intellectual ability (i.e. IQ scores) from short texts?” is merely the pursuit of knowledge, and thus the only harms we need to consider are the potential harms following from the possession of such knowledge. Taking that view — that any question a researcher might ask is necessarily a well-formed pursuit of knowledge — naturally leads to framings in terms of academic freedom.

Academic freedom

The American Association of University Professors, founded in 1915 with the mission inter alia of advancing academic freedom, provides the following definition of academic freedom:

1. Teachers are entitled to full freedom in research and in the publication of the results, subject to the adequate performance of their other academic duties; but research for pecuniary return should be based upon an understanding with the authorities of the institution.

2. Teachers are entitled to freedom in the classroom in discussing their subject, but they should be careful not to introduce into their teaching controversial matter which has no relation to their subject. Limitations of academic freedom because of religious or other aims of the institution should be clearly stated in writing at the time of the appointment.

3. College and university teachers are citizens, members of a learned profession, and officers of an educational institution. When they speak or write as citizens, they should be free from institutional censorship or discipline, but their special position in the community imposes special obligations. As scholars and educational officers, they should remember that the public may judge their profession and their institution by their utterances. Hence they should at all times be accurate, should exercise appropriate restraint, should show respect for the opinions of others, and should make every effort to indicate that they are not speaking for the institution.

Notably, this definition takes the point of view of academic freedom being located in the relationship between scholars and their employers. One recent high profile case of lack of academic freedom is the politically appointed Board of Trustees of the University of North Carolina refusing to grant a tenured appointment to Pulitzer Prize and MacArthur Genius Award recipient Nikole Hannah-Jones, despite the recommendations of faculty at the university and the fact that previous professors hired into the same position (Knight Chair of Journalism) were hired with tenure. Another concerns the attempts by Google to suppress the publication of the paper ‘On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜’ (disclosure: I’m a co-author) and subsequently firing two of its authors (Dr. Timnit Gebru and Dr. Margaret Mitchell). Corporate researchers don’t have the same expectations of academic freedom as higher education faculty, but the latter case points up the threats to academic freedom that arise when research resources (funds, people, data) are concentrated away from the academy. Notable in all of these cases is that the scholars who were denied academic freedom were engaged in speaking out (through their scholarship and otherwise) against injustice.

I take academic freedom to concern the relationship between scholars and their employers (and governments): scholars should be able to pursue scholarship without fear of reprisal from their employer or their government. This does not entail that any given venue is obligated to publish whatever is submitted to them. On the contrary: publishing venues make editorial decisions all the time and furthermore their prerogative in doing so is, in itself, an issue of academic freedom.

In addition, just because one is a scholar, doesn’t mean that academic freedom entails the freedom to do whatever, with no regard for the rights of others. Inherent in the AAUP definition above is a call out to the responsibilities that come with the role of scholar.

In this connection, I find Dr. Andrew Perfors’ recent blog post on academic freedom and academic integrity to be a very clear statement. Perfors takes up the case study of fellow U Melbourne faculty member Dr. Holly Lawford-Smith running a website eliciting anti-trans narratives in the form of personal stories of “negative interactions with trans women in women’s spaces”. These crowd-sourced stories are then cherry picked with the goal (per Lawford-Smith’s remarks) to influence policy limiting the rights of trans people. Lawford-Smith goes out of her way to say that this isn’t research (while leveraging the University’s name in the website), and Perfors points out that therefore it is not protected by academic freedom. Perfors also provides a detailed analysis of the flaws in the project (were it actually research) and the harms that it can do.

Those flaws and harms are the reflection of a lack of academic integrity. Perfors writes:

With freedom comes responsibility, as they say, and as an academic I take my responsibilities seriously. Those responsibilities include things like: (a) doing my best to tell the truth, the whole truth, and nothing but the truth; (b) approaching my research with an open mind and designing studies in such a way that the data could actually change my mind; and (c) taking seriously the duty of care I have as a teacher and mentor to my students and mentees.

As a field presently awash in corporate money (and the national research funding that seems to follow corporate money), in the public eye, and doing research that has the potential to be quickly commercialized, what responsibilities do our professional organizations have around academic integrity? If we were just a field concerned with solely academic pursuits, where academic here means ‘without societal impact’, it might be enough to say that our professional organizations (and publishing venues) should only be concerned with scientific rigour, or as Tsarapatsanis & Aletras put it “the disinterested pursuit of knowledge for its own sake” (preprint p.8). But if computational linguistics/NLP (let alone legal NLP) were ever a solely academic pursuit, it surely isn’t now. I believe that it is for this reason that the ACL adopted a code of ethics and that the ACL conferences have instituted ethics review. In what follows, I’ll outline my take on what those review processes are for.

Taxonomies

A key point of ethics review, and one that I think is lost in Tsarapatsanis & Aletras 2021 (and to a lesser extent in Leins et al 2020) that it is not a binary judgment of ethical/unethical, but rather a much more nuanced discussion, along at least two dimensions.

First, a piece of research might raise ethical concerns in several ways:

In some cases, the research question being pursued is ill-formed, and posing it (and especially appearing to have answered it) can do harm. Clear examples here include the recent spate of work that amounts to digital physiognomy (see Aguera y Arcas et al 2017), claiming to be able to predict such things as sexual orientation, criminality, or truthfulness from photos or videos of people. The same can be said for modern ‘race science’, which researchers attempting to correlate IQ scores with population groups (e.g. Clark et al 2020 [now retracted], Lynn and Vanhanen 2002; for critique see Barnett and Williams 2004). Claiming to be able to classify people’s gender based on their appearance is similarly problematic (Scheuerman et al 2019). As Scheuerman et al write (p.19, emphasis added):

Gender can be understood through a multitude of perspectives: a subjectively held self-identity [103], a self-presentation to others [41], a social construct defined and maintained through performative acts [21], and a demographic imposed by society [40, 95]. In the context of computer vision, we have shown how the design and use of facial analysis and image labeling systems collapse these perspectives into a singular worldview: presentation equals gender. […] Self-identity is not used by computer vision systems. After all, it cannot be seen.

Some might say: we’re just pursuing questions. What’s the harm in asking questions? The harm lies in the fact that the questions presuppose harmful assertions and thus lend legitimacy to ideas that are used to oppress people.

While the cases listed here seem pretty straightforward now, I don’t believe I or anyone can identify all such problems a priori. That is, I can imagine framing research questions that involve such harmful presuppositions.² In that case, I’d very much prefer to be alerted to it, and this is indeed one of the functions of ethical review. Sometimes, the best course of action is to abandon the ill-formed research question. Other times, the heart of the question can be maintained while reframing it: for example work looking at underrepresentation of some genders in NLP publishing can be reframed away from binarizing ‘women v. men’ to a more inclusive approach.

In other cases, independently of whether the research questions being pursued are legitimate, the methods being used to pursue them do harm: clear cases here include the atrocities committed in the name of medical science that led to the Nuremberg Code (and eventually the Belmont Report; see Metcalf 2014) as well as non-medical experimentation that violates individuals’ privacy and autonomy. Also in this category falls exploitation of crowd workers (see e.g. Fort et al 2011). Again, however, there are plenty of cases which are less clear: For example, under what conditions is it just or appropriate to have annotators work on annotating hate speech, especially given that the people best positioned to understand what is hateful speech are the very targets of it? (In this connection, I found particularly enlightening the exchange between Jonsen and Charo in The Birth of the Belmont Report about medical research involving children, who by definition cannot consent, and the importance of continuing to grapple with principles of ethical research.)

Finally, there are cases where the research question is well-formed and the research methods are ethical, but systems built as a result of the research can be either used for harmful ends (dual use problems as discussed in Hovy and Spruit 2016 or simply problematic primary use cases) or learn, transmit and amplify biases, leading to allocational and/or representational harms (Crawford 2017). For just a short selection of examples, consider the ad delivery system that associates ad copy suggesting criminal records preferentially with Black names documented by Sweeney 2013, the hyper sexualized search results for “Black girls” documented by Noble 2018,³ and the poorer performance of speech recognition systems on African American Language, as documented by Wassink 2021 and Koenecke et al 2020.

The second dimension to consider are the possible remedies we have available. That is, what can NLP as a field do to better understand and mitigate the potential harms that might come from pursuing research in our field? Fortunately, there is a lot we can do:

  • We can engage in research exploring the potential for harm and documenting the problems that have already arisen or might arise. A lot of the work cited above (e.g. Sweeney 2013, Noble 2018) falls into this category. See also Benjamin 2019, Blodgett et al 2020 (and the extensive literature on bias in NLP reviewed therein), Bender, Gebru et al 2021, and much else. This literature is large and growing!
  • We can also develop field-wide standards around data collection practices, fair payment of crowd-workers, etc. These typically begin as proposals from specific groups (e.g. Fort et al 2011, Shmueli et al 2021) but may eventually get codified into reviewing standards.
  • We can hone the practice of writing limitations/broader impacts/ethical considerations sections (see, e.g. the guidance for NeurIPS 2021 as well as Sim et al 2021 and Nanayakkara et al 2021). This practice can help with dual use issues, through informing policy, and also can mitigate the tendency to overclaim (which leads to harmful applications of ‘AI snake oil’).
  • Another key step is the development of standards and expectations for dataset and model documentation, as in the proposals from Bender and Friedman 2018, Mitchell et al 2019, and Gebru et al 2020.
  • Finally, review processes have a place in all of these interventions: Work on exploring the potential harms of NLP gets reviewed just like all other research; standards around data collection practices etc. become criteria for review; ethical considerations sections are reviewed; and reviewing standards might also come to require some degree of documentation of datasets presented or used in research. If, through review, a paper is rejected because the questions it asks are ill-formed and thus unethical, that is in fact an instance of it getting rejected on scientific grounds (as well). With robust review practices in place, papers may be rejected for unethical research methods or unethical primary use case or dual use issues. This is likely a rare occurrence and really the smallest impact of ethical review practices. It is also, as discussed above, the prerogative of publishing venues as part of their academic freedom.

Goals and outcomes of ethics review

ACL ethics review, as apparently imagined in Tsarapatsanis & Aletras 2021, is only an exercise in filtering out papers deemed ‘unethical’. Furthermore, Tsarapatsanis & Aletras point to differences across cultures and communities in legal and ethical norms (and argue on that basis for always picking the most permissive stance among those represented), implying that the ACL’s ethics review process is imposing the norms from one culture on others.

On both counts, this is a misrepresentation of what has been happening in ACL ethics review. As Karën Fort and I write in our ethics review process report back for NAACL 2021:

In recruiting researchers to join the NAACL 2021 Ethics Committee, we put a particular emphasis on diversity. On the one hand, we wanted to ensure that potential societal impacts of work published at NAACL were considered from multiple different cultural perspectives. On the other hand, we also wanted to make sure that we weren’t treating this additional service work as solely the job of minoritized people in our field. […] We began with people we already knew directly and asked for further recommendations, especially from world regions from which we had not yet managed to recruit. (Everyone is busy; in many cases our original contacts couldn’t join the committee but were able to send us additional names.) Because ethics/societal impact is a relatively new area within NLP but also growing in adjacent fields, we called on researchers who look at ethics/societal impact and AI a bit more broadly. Our final committee included 38 reviewers (plus the two chairs), representing 22 countries, with 15 members with affiliations in Europe, 10 in US+Canada, 6 in Asia, 5 in Latin America, and 4 in Africa.

Furthermore, we understood the point of convening an ethics review committee to be about raising the overall quality of practice around ethics in our field, both in terms of the papers accepted to NAACL and in terms of raising awareness of known issues and how to engage with these kinds of questions. To that end, we provided guidance to authors ahead of the submission deadline, provided guidance to reviewers, and ensured that all papers flagged for ethics review received feedback from the ethics committee. Finally, we strove for transparency, publishing the full list of committee members, providing guidance and information to authors beforehand, and a detailed reflection on the process afterwards.

Thus far from using the norms of one culture to block or filter papers (as one might imagine, reading Tsarapatsanis & Aletras 2021), we instead saw our task as one of building towards norms of the ACL community, with input from diverse points of view.

Power protecting power

Finally, I’d like to return to the point, previewed at the start of this post, about power protecting power. Tsarapatsanis & Aletras provide this definition of ‘moralism’, which they present as a bad thing (preprint p.8):

First, it might involve the belief that substantive ethical values, other than the disinterested pursuit of knowledge for its own sake, should be integral goals of research

and they urge us to weigh harm to researchers (curtailing their ‘academic freedom’) against harms to data subjects and other individuals. I would argue that the position that scholarship is best understood about solely the “disinterested pursuit of knowledge for its own sake” without any grounding in community, in duties of care, in struggle or liberation is an attempt to close off the academy (and whatever power it wields) to those who have the privilege of positioning their viewpoint as ‘objective’, ‘scientific’ and ‘apolitical’ (Stitzlein 2004, Gebru 2020, Charity Hudley et al 2020, Birhane 2021). Academic freedom is the freedom to engage in scholarship which challenges power structures, both institutional and political/cultural. When the right to pursue whatever questions catch one’s fancy is held up as a core value as weighty (or more) than the rights of data subjects and others research might harm, this is not a defense of academic freedom; it’s power protecting power.

The opposite of power protecting power is decentralization of power — democratization in the sense that Divya Siddarth articulates (in this recent Radical AI podcast episode, and surely also elsewhere), that is, a sharing of power coupled with the production of systems of governance and accountability. As Karën Fort pointed out (in response to a draft of this post), academic freedom is not a pre-existing status quo (putatively disrupted by the institution of ethics review processes), but rather something that we are building towards, within existing systems of oppression as well as existing attempts towards more equitable power sharing. What that process of striving for academic freedom looks like is surely dependent on the different academic, political, and economic circumstances we find ourselves in (both individually and the sum of such circumstances across an institution like the ACL). This post has been written from my perspective as tenured faculty at a US institution and I am eager to learn more about this space from others situated differently, with the goal of striving towards better and better processes within the ACL (and other international organizations).

Endnotes

Sorry, I just can’t write without footnotes/endnotes! I tried to keep things to a minimum, but here they are:

[fn1] If by ‘anonymous’ they intend that the judges involved in checking the rendered judgments are unaware that the automated ones are automated, this strikes me as particularly vexed. Automatic systems mimicking humans is a ‘bright line’ in AI ethics (see also Bender, Gebru et al 2021, section 8). At the same time, any introduction of an automated system needs to contend with automation bias (Skitka et al 2000), i.e. the tendency of humans to perceive machine outputs as more ‘objective’ and therefore more ‘correct’. My own guess is that framing the task as prediction rather than comparison is probably a mistake. It may be helpful to judges to have a system which can provide information along the lines of “the sentences rendered here are shorter/longer than the average over previous cases where defendants have been convicted of similar charges.” But even then, the effect of such an intervention on the treatment (and human rights) of defendants should be studied carefully. Kristian Lum’s recent tweets are particularly apropos here:

[fn2] In this connection, it’s tempting to forgive certain questions as ‘a product of their time’ and certain researchers for not knowing that e.g. race is a social construct and not a biological fact. However, that move quickly erases the fact that the social construct of race (as well as race science) itself was built to support chattel slavery, colonialism and other forms of oppression — and that there were plenty of people all along who understood the full humanity of racialized people.

[fn3] We have seen over the past few years a pattern where Google fixes certain search results in the wake of public outcry, but the underlying issues are not addressed. Even if the English query “Black girls” no longer returns hyper sexualized results, new issues keep coming up, like this one, documented by Hank Green in Jan 2021, and subsequently fixed:

Tweet of screencaps showing differential answer box results for the queries “when did people come to america” (a snippet about the Jamestown colony from the webpage www.americaslibrary.gov) and “when did humans come to america” (33,000 years ago, plus a snippet from the English Wikipedia page on Settlement of the Americas)

Playing around with the suggested searches for strings like Why are Black women, Why are Black men, Why are women, Why are men, etc. shows that the suppression of the suggested search feature in cases where it’s found to be harmful is very much ad hoc.

Acknowledgments

In writing this post, I have benefitted from thoughtful input from Chris Potts, Karën Fort, Amandalynne Paullada, Meg Mitchell, Timnit Gebru, and Andrew Perfors.

--

--

Emily M. Bender

Professor, Linguistics, University of Washington// Faculty Director, Professional MS Program in Computational Linguistics (CLMS) faculty.washington.edu/ebender