On Academic Freedom and Ethics Review: Continuing the Conversation

Emily M. Bender
7 min readJun 24, 2021


This blog post is a response to a response to a paper. In particular, Dimitrios Tsarapatsanis and Nikolaos Aletras have published this response to my blog post reacting to their paper (to appear in Findings of ACL 2021). In the spirit of continuing the discussion, I want to reply to some points in their blog post. In the interests of getting this out quickly, this will primarily be a point by point response. Readers entering the discussion from this post are encouraged to read the two previous posts in full first.

In brief, my main points here are:

  1. The ‘legal decision’ metaphor prominent in Tsarapatsanis and Aletras (henceforth DT&NA)’s writing (both the paper and the blog post) is an overly restrictive and unhelpful lens with which to consider ethics review. By ‘legal decision metaphor’ I mean such concepts as “allocation of burden of proof”, “decisions”, “deliberation” and even the notion that what ethics committees are primarily doing is “ethically evaluating research”.
  2. Even though “burden of proof” is an unhelpful metaphor, the question of who should be doing the work of exploring the potential societal consequences of work is important. I argue that it belongs with researchers proposing and carrying out work, at least in the first instance, and that part of the purpose of ethical review is to check for (and provide feedback on) this aspect of research. This represents a culture shift in our field, which is why the NAACL 2021 ethics chairs (Karën Fort and myself) took a pedagogical approach.
  3. Ethics review isn’t primarily about determining whether a particular piece of research is ethical (enough) or not. (It is also emphatically not about judging whether researchers are “ethical”.) Bringing ethical considerations in research review (at the publication step or earlier) is not about achieving or requiring some imaginary ideal, but rather about an on-going conversation, in the review process and the published literature, regarding how our research impacts the world.

With that overview, here are a few pointwise responses:

[DT&NA] Second, none of the parameters we zone in should be understood as ‘calling the shots’ in any absolute kind of way. On the contrary, they are merely pro tanto reasons, i.e., factors that should be considered in ethical evaluation along with all other applicable reasons and always within specific contexts. So, contrary to what Prof. Bender claims, we do not hold ethical judgments to be ‘binary’. Indeed, ethical judgments are much more complex than that.

I do not believe that DT&NA hold ethical judgments to be simple. Rather, my point is that the ethics review process, as practiced so far by ACL (and certainly for NAACL 2021, which is the instantiation I was most closely involved with) is not only or even primarily about the (necessarily) binary judgment of whether or not papers should be accepted. For details, see my previous post.

[DT&NA] Moreover, ‘mad science’ cases can get trivially caught at the academic institutional level by using Institutional Review Boards.

This is frequently asserted, and yet, what we see, not only at ACL venues but also across the venues where machine learning research is published, is that even among the authors at institutions with IRBs or equivalent, very few engage them. For details, see this paper by Santy et al, to appear in Findings of ACL 2021. Furthermore, even when IRBs are both available and consulted, many are primarily focused on protecting human subjects, relatively narrowly defined. For example, the Human Subjects Division at my institution (the University of Washington) uses this definition:

Human subject: A living individual about whom a researcher obtains (1) data through intervention or interaction with the individual or (2) identifiable private information. See the WORKSHEET Human Subjects Research Determination for definitions of the bolded words.

By the above definition, much NLP research (including work developing new datasets via web-scraping or using existing datasets) is probably not understood as involving human subjects. This doesn’t mean it is without risk, just that it’s not in the purview of the IRB.

[DT&NA] Second, and more substantively, our recommendation sets out a rule about the allocation of the burden of proof. According to that rule, when someone claims that some research project could, for example, have harmful effects, the burden of proof lies with the person making the claim. Crucially, successfully meeting the burden requires more than just armchair speculation about potential harmful effects. The evidence that such harms could occur must be real and specific. Thus, we believe: (a) that it should be the burden of critics of a research project to present always specific as opposed to generic evidence about potential risks; (b) that at least risks that are known to be speculative should not be considered.

Given that in most cases we are talking about research towards technology that has not yet been deployed, I don’t see how condition (b) could ever actually be met. But more generally, as I noted above, I think that “burden of proof” is not a good metaphor here. A bit further down, DT&NA write:

What is needed to meet our proposed burden-of-proof rule in such cases is detailed and serious engagement with the pertinent technological and social context.

I argue that, on the contrary, that ethics review isn’t about reviewers providing that detailed and serious engagement but rather checking whether the authors have done so. The expected result of such detailed and serious engagement isn’t a “proof” of anything (let alone a proof of lack of risk), but rather a thoughtful exploration of the limitations of the technology and factors that anyone developing it further, procuring it, or deciding whether and how to regulate it should be aware of.

In this context, I note that DT&NA pose a question that they say Leins et al should have answered (as part of their critique of Chen et al) which was actually already answered in Chen et al 2019 (emphasis added):

[DT&NA] In the specific example under discussion, this involves answering questions such as the following: what is the real possibility that convicted people could be identified by a dataset that does not conform to (mostly Western European) data privacy norms? Is the dataset generally available to the public? If not, is it available specifically to Chinese authorities? Under which conditions? If one’s goal is the identification of a specific individual, is it not simpler to just retrieve the cases from the public domain (i.e., from the Chinese Supreme Court itself)? If so, does the dataset exacerbate that risk? Given our recommendation about the allocation of the burden of proof, these are the real questions to answer.

Yes, the dataset is generally available to the public, on github, per footnote1 of Chen et al: “The dataset can be found at https://github.com/
huajiechen/CPTP” This obviates the following two questions. As to the other questions: Whether or not it is simpler to retrieve the cases from the Chinese Supreme Court itself, the existence of this separate copy of the data raises problems. If there aren’t any mechanisms for keeping it up to date with the underlying dataset (should a case be voided, for example, or some or all of the dataset re-sealed), then the existence of the separate copy opens up the risks of mis/disinformation and illicit access.

DT&NA end with an interesting discussion of competing notions of academic freedom. However, this discussion ends up misrepresenting my position:

[DT&NA] Now, a conception of academic freedom such as the one put forth by Prof. Bender in her post(‘the freedom to pursue research in ways that challenge power’) is perfectionist. It stipulates an ideal (challenging power) and measures the ethical appropriateness of research against that ideal.

Crucially, I’m not saying that research has to challenge power in order to be ethical. Rather, I’m saying that the purpose of academic freedom is protect research that challenges power. Likewise, there is no notion of ‘ideally ethical’ research and that is not what is being asked for.

[DT&NA] Researchers must of course abide by scientific norms, including methodological norms; their work is constantly assessed by their peers. An important feature of scientific norms, though, is that they are content-independent: they are not to do with choosing any specific research topic nor, a fortiori, a specific interpretation of the value of scientific research.

The norms that I and others are proposing, through the ethics review processes at ACL and other venues, are content-independent. They include things such as the notion spelled out above, that papers proposing new technology should include thoughtful and thorough discussion of the potential effects of that technology when deployed in the world and that research questions should not presuppose unproven (or false) propositions known to be harmful. These hardly seem any more content-dependent than norms such as statistical models should be appropriate to the kind of experiment being run or relevant prior work should be acknowledged, both of which require engaging with the specific content of a paper in order to evaluate.

DT&NA locate their approach as deontological and explain:

[DT&NA] One important aspect of deontological approaches is thus that they are anti-perfectionist: they leave it to free persons to choose their ends and preclude interference by other persons or institutions if certain constraints are satisfied, irrespective of the ethical optimality of actions.

But, again, I don’t think that anyone is asking for “ethical optimality”, just that the review criteria should also include questions such as those spelled out just above. In fact, I think that this is yet another example of how the term ‘ethics’ actually muddies the waters here. The point isn’t to engage in interminable debates about how to judge research (let alone researchers) but rather to connect the research we do, so frequently motivated in terms of real-world applications, to considerations of what will happen if it is so deployed — and take some responsibility for those potential outcomes, both as individual researchers and as a research community.

With thanks to Leon Derczynski for helpful comments as I drafted this.



Emily M. Bender

Professor, Linguistics, University of Washington// Faculty Director, Professional MS Program in Computational Linguistics (CLMS) faculty.washington.edu/ebender