Is there research that shouldn’t be done? Is there research that shouldn’t be encouraged?
[Disclaimer: This is not a scholarly piece of writing, but rather a series of thoughts set down in a hurry in the context of an on-going discussion, as a medium post rather than a tweet thread. In a scholarly writing version of this, there should be far more citations, both to previous authors who have considered similar questions and to studies supporting the factual assertions.]
On Dec 4, 2019, the organizers of GERMEVAL 2020 Task 1 announced their task on the corpora mailing list (full post here). The task is entitled “Prediction of Intellectual Ability and Personality Traits from Text” and consists of two subtasks, based on data from college applications to NORDAKADEMIE, a private university in Germany. The first subtask involves ranking individuals according to “Intellectual Ability” based on their textual answers to a test administered as part of the college application process. Systems’ rankings are compared to rankings derived from IQ test scores and high school grades. The second subtask involves reproducing “operant motive” labels provided by psychologists based on applicant responses to a test. These motives are said to “allow psychologists to predict behavior, long-term development, and subsequent success.”
Jacob Eisenstein replied to this announcement cautioning, “As a community, we should think carefully about whether it is appropriate to work with IQ test results as data, and what the applications of this research might be.” He continues:
In the United States, there is considerable evidence that IQ tests are racially biased. In the past, courts have excluded IQ tests from educational placement in California for precisely this reason. I wonder if there is research on this topic in the German context.
It is not difficult to imagine that the outcome of this shared task would be a set of technologies that encode spurious correlations between estimates of intelligence and the linguistic features of specific racial groups. If such a system were trained on data that already contains biases, there is a risk that this bias would be not only entrenched but amplified. And even if the IQ test statistics are not themselves biased, an NLP system that predicts IQ from text could introduce bias, if there is an unmeasured confound that is statistically associated with both IQ and race.
I hope that these issues will receive serious consideration from the organizers and participants in the task.
At this point, I joined the conversation with the following post:
Thank you, Jacob, for this reply. This task seems irresponsible/poorly conceived to me. Before designing such a task, I think it is imperative to consider its use cases: When and why would we want to predict IQ scores or high school grades from text? Given the high potential for any such system to learn preexisting biases (themselves the result of structural discrimination in society), what are the likely impacts, especially on already marginalized populations?
I also took my concerns to Twitter:
(The discussion on the corpora mailing list and on twitter has continued, and can largely be accessed via the links quoted above and the #GermEval2020 hashtag. I particularly encourage readers to follow the corpora mailing list responses by people who can speak to what it’s like to live in European countries as members of racialized groups.)
My tweet thread above ends with a call to rescind the shared task, while recognizing that doing so carries a cost to the organizers (who have by this point put in considerable time and effort and, should the task be pulled, won’t see the academic credit that would otherwise follow), because I believe that the promotion of this activity as a shared task does significant harm in the world — to the extent that it clearly outweighs the cost to the organizers of pulling the task.
Task Organizers’ Response
The shared task organizers listened to the range of criticism and produced a thoughtful response on Dec 5, acknowledging that IQ tests and psychometric tests are “questionable instruments” and that “ethics, possible biases, the consequences and the purposes of this task” should be addressed when working on such things.
They go on to justify the shared task on the basis that such work is being done by private entities and so it is better done in the light of day:
As a matter of fact, NLP is already used in assessment centers like the one we obtained the data from, but it is used in a crude and unscientific way.
We as a community should be investigating how to use our NLP expertise to understand the assessment problem and improve the current assessment models, which would include both technical and ethical work. The scientific value of this line of research would lead to avoiding more harm than to cause harm. Just not doing stuff because it could potentially be misused is a killer argument against research as a whole.
They state their plans to revise the workshop page and call for papers to disavow the problematic aspects of IQ testing:
We will take this as an opportunity to clarify both the background and the research questions for our task, revising the workshop page and the call for papers to avoid being misunderstood like this: this is in no way meant to promote the instrument of IQ testing, nor automating it based on text. This is in no way an attempt to label people as stupid, this is in no way presupposing that there is one inherent intellectual ability, etc.
And also propose to make space in the workshop to address ethical issues:
We are also considering a call for papers on the critical reception of psychometrics for this event, or a panel discussion on the topic during the workshop, to create awareness for potential issues with these metrics.
I appreciate the thoughtful response of the shared task organizers, and I am heartened that they are attending to community discussion of their proposal. If this workshop/shared task is to go on, I think including a session consisting of a panel and/or contributed talks about the problems with psychometrics would be extremely important.
As things stand, however, I do not believe that any of the responses offered can legitimize the task. In more detail:
1. The purpose of the task
I asked in my initial corpora mailing list posting: “When and why would we want to predict IQ scores or high school grades from text? Given the high potential for any such system to learn preexisting biases (themselves the result of structural discrimination in society), what are the likely impacts, especially on already marginalized populations?”
This question has not been answered. The task organizers’ response includes the following:
Regarding our overall aims: we are interested in how textual descriptions of images are related to IQ test outcomes and high school grades (resp. analyzed operant motives for subtask 2). We are interested in what models are most suitable, and what aspects of the text are most informative.
However, a correlation analysis would not make for a good shared task. Thus, we decided to propose a ranking task (resp. classification task) based on the texts.
This doesn’t answer the question of when and why we would want to either know the correlation or do the prediction. Without a clear answer as to the use case of this information and careful consideration of possible impacts on stakeholders, and in light of the problematic nature of the underlying data, it is not prudent to put such tasks out into the world.
Furthermore, I have concerns about the casting of scientific questions about correlation as prediction for the purposes of creating a shared task out of the data. If the true goal is to understand whether there is any correlation (and if so, why), how does creating a leaderboard for the NLP/ML SOTA-chasers to climb help towards that understanding?
The organizers write:
Regarding racial bias: unfortunately, the German education system is known to have a socio-economic bias, which leads to a vast under-representation of people with a migration background in higher education, which is absolutely worth fighting against.
This, paradoxically, leads to our data being less prone to the influence of such biases.
On the contrary, a system trained on a narrow subset of the population, should it ever be deployed more broadly, will surely suffer more from bias: Imagine that a system manages to do well on this leaderboard, exploiting whatever quirks of lexicon or syntax that happen to correlate with the target scores. Now imagine that someone attempts to apply that system to a broader sample of the population, who undoubtedly represent a broader diversity of ways of speaking (and writing). At best, the system will have unpredictable behavior for populations not well represented in its training data. At worst, it will systematically predict “low intellectual ability” for anyone who happens not to have the features it associated with “high intellectual ability” in its training data.
3. Informed consent
There is no indication that the students whose data is being used in this way consented to having it used for research purposes in general, nor this kind of research purpose in particular.
4. “It’s happening anyway, better it should be out in the open/more scientific.”
A workshop dedicated to papers explaining why and how IQ tests are biased and the conceptualization of IQ as an inherent and defining property of a person is flawed would be an excellent response to the fact that companies are attempting to predict IQ qua “intellectual ability” from writing samples. A set up designed to make attempts at such prediction more “scientific” does the opposite of this: it reifies the dependent variable as something in the world that can be predicted (in general, and from the writing samples specifically).
There are research questions that are better not asked because the answers are dangerous (arguably, humanity would be better off if we had managed to collectively refrain from developing nuclear weapons, for example) and then there are research questions that are better not asked because, by lending an aura of “science” to their subjects, they serve to reinforce rather than investigate and problematize ideologies that do harm in the world (here, in the form of inequities in educational opportunity).
Finally, I remain deeply concerned about the overall framing of the task. The very term “intellectual ability” carries with it all of the problematic presuppositions of the IQ test. Further, the Twitter account the task organizers created in order to join the discussion has the handle @psychopred, which I take to evoke “psychological prediction”, again claiming interest in the ability to predict things about the psychology of individuals on the basis of machine processing of text they produce.
To answer the questions raised in the title of this post: Clearly, there are bad/unethical applications of technology, where the bias learned by “AI” systems can do significant harm in the world (for example, the extremely biased systems for recidivism prediction revealed by ProPublica’s investigations). Does that mean that the same technology shouldn’t be explored as research questions? In many cases, I would say yes, or at least that the framing of the questions needs to be very carefully considered. If the research question at hand is “Can we predict intellectual ability from writing samples given in an MIX test?” I’d say that question is ill-formed, because the notion “intellectual ability” is unfounded. If instead the question is “Can we predict IQ scores and high-school grades from writing samples given in an MIX test?” then I’d say (as I have), when and why would you want to do that and what are the possible negative consequences of deploying a system that attempts to do so?
Setting up a shared task is effectively doubling down, by encouraging other people, likely including many who are relatively inexperienced researchers, to put their effort into these research questions. Thus shared task organizers bear an additional responsibility for doing due diligence to design research questions that (a) bring valuable knowledge to the scientific community and (b) avoid harm to relevant stakeholder groups.
Epilogue: The “ethics police”
Finally, I’d like to address the question of whose responsibility it is to raise concerns like this, and in particular the accusation that I have somehow appointed myself the “ethics police”.
Outside of the TALN community in France, which started thinking about these issues a few years earlier, most of the NLP community didn’t spend much time worrying about ethics prior to about 2016. Since then, it has become a major topic, with workshops at EACL 2017 and NAACL 2018, ethics tracks at NAACL 2019 and ACL 2020, courses at many institutions, and of course connections to the larger discourse around ethics and AI, including the FATML/FAT* conference series.
Despite this increase in attention, we have not yet developed any field-specific best practices for ethics disclosure and review. This means that, on the one hand, it is more difficult for groups of researchers to know what to take into consideration when planning a research project or shared task, and on the other hand that it is particularly important to hold open discussions like the one going on around this task. I applaud the task organizers and others in the NLP community for engaging in this discussion. Given the widespread use of NLP technology, we have a lot of work to do as a community towards understanding how to integrate ethical considerations at every step: in NLP education, in research (including shared task design), in technology development and deployment and in advocating for and advising on government regulations. If this conversation moves the field forward along that path (or even just increases awareness and interest in the problems), I see that as a positive outcome.
Some links regarding problems with IQ tests: