Towards Guidelines for Evaluating NLP Shared Tasks

3 min readDec 20, 2019

[This is a consolidation of a Twitter thread. For some context, see this previous post.]

The GermEval2020 shared task organizers have updated their call, and it is clear that they are taking community feedback seriously.

Perhaps the most significant change since the last version is in the framing. The task is no longer framed in terms of predicting intellectual ability. This is an important change, especially given the current climate of #AIHype. We should all be careful to not overclaim re what our systems are doing. The organizers also now provide pointers into the misuse of IQ tests, which is good. (Though they seem to suggest that there are legitimate uses for IQ tests, a claim with which I strongly disagree.) There is also now more information about how the data was collected and its context, as well as assurances about informed consent.

The underlying tasks, however, have not changed. Even without the framing of equating IQ scores with intellectual ability, the task of predicting IQ scores from text is problematic, for at least two reasons:

The underlying problems with IQ scores. Why should we want to predict labels that we know to be too problematic to apply in real world contexts? I guess one possible answer is to show what kinds of external factors these labels correlate with. But the task does not appear to be set up to investigate such things.
There is no reason to believe that sufficient signal to pick up what IQ purports to measure is available in the input data. Of course, there’s likely to be *something* that correlates, at least weakly, with the target output, and surely deep learning systems will be able to model that correlation.
But w/o a theory of why those things should correlate, we have no way to make sense of the results. (A plausible theory is suggested by @jacobeisenstein: societal privilege correlates both with IQ scores and specific linguistic features anointed as the prestige variety.) See also @zacaharylipton’s talk from #NeurIPS2019:

https://twitter.com/amuellerml/status/1205558654012973056

(Aside: There’s also a second task about predicting “operant motives” from text. Frankly, I’m skeptical here, too, about the underlying data for the task. Not familiar with that literature, though, so I’m setting it aside for now.)

I ended my previous medium post with this call to action:

Given the widespread use of NLP technology, we have a lot of work to do as a community towards understanding how to integrate ethical considerations at every step: in NLP education, in research (including shared task design), in technology development and deployment and in advocating for and advising on government regulations. If this conversation moves the field forward along that path (or even just increases awareness and interest in the problems), I see that as a positive outcome.

In that light, I think this whole conversation has been very constructive. I think it is high time we develop a set of a guidelines to use in the evaluation of shared tasks. This conversation should take place in fora other than Twitter, but I think some Twitter discussion will also be appropriate. Here’s a pointer to a thread for that:

[That second thread reads as follows:]

New thread for ideas on guidelines for evaluating work, especially work like proposed shared tasks, which attempt to direct the research effort of larger groups of people & thus are especially important.

First, following @timnitGebru et al, I think it’s really important that these guidelines be open-ended questions. Checklists invite ethics-washing.

So, in that light, I propose these questions, which should be answered by shared task proposers & considered by shared task evaluators (eg conf hosts):

How does the output of the ML task relate to the information it’s framed as predicting?
Does the input to the ML task plausibly contain enough information to predict the output?
What are the intended uses of this technology?
What are possible misuses of this technology and how can they be minimized?
If the technology is working as intended, who might be harmed and how?
If the technology is not working as intended, who might be harmed and how?
How were the data collected and from whom? (Here, documentation like a data statement or datahseet would be appropriate. See link above for datasheets and here for data statements.)

What else should we be asking? What further advice can we provide on how to go about answering these questions?

[ Some responses to the second thread that add important additional questions. ]

Towards Guidelines for Evaluating NLP Shared Tasks

Written by Emily M. Bender