Linguistics, NLP, and Interdisciplinarity
Or: Look at Your Data
This is meant to be a brief blog post reflecting on a couple of twitter megathreads I’ve been involved in over the past week. This is my attempt to summarize my side of the story, while hopefully not mis-representing anyone else. Some of this concerns specific papers, in one case anonymous, in another case not. Let me say at the outset that Wang & Eisner acknowledge an important error. In the post, I’m going to talk about that error from my point of view before I get to their acknowledgment, but the point is only to relay the story of what unfolded in that megathread and not to suggest that they haven’t addressed the issue.
It starts with this tweet of mine, on Thursday Nov 2, borne of frustration while reviewing:
In particular, I was reading a paper that was making claims of cross-linguistic applicability (being able to parse text in a new language, without labeled training data in that language) and testing them with the Universal Dependencies (UD) treebanks. I happen to know that the UD treebanks — being collected more or less opportunistically, rather than deliberately sampling the way a typologist would — include sets of quite closely related languages. So I was of course curious to see whether the evaluation respected this fact, or whether the Scandinavian languages in the set (Danish, Swedish, Norwegian) say, or the Romance languages (French, Italian, Portuguese, Spanish, Romanian, Latin) or the Slavic languages (Church Slavic, Croatian, Czech, Slovenian) were split across training and test. This, in my opinion, would significantly weaken any claim of language independence.
The paper doesn’t actually list the languages. Rather, the results tables give the identifiers of each treebank. Squinting at those, trying to work out where the Romance languages etc. were, I noticed something worse: Finnish and Latin are represented by two treebanks each in the UD 1.2 release and, in each case, one of the treebanks was in train and one was in test.
Looking a bit further, I found that this paper was basing its train/test split off of that in previous work (Wang & Eisner 2016, 2017) from which they were borrowing also the Galactic Dependencies dataset (dependency treebanks for synthetic languages).
How could this have come about? My guess was that it connects to a trend that I have noticed in NLP to avoid looking at the data and another trend that of treating all language(s) as interchangeable. (This is what leads people to fail to name the language they are working on when that language is English, as if English were a suitable stand in for all languages.) It is only because this is accepted that it is seen as acceptable practice to not name the languages in a study like this one, but only put the treebank identifiers in the table. And because that is acceptable, it’s possible to miss that you’ve got the same language in both training & test, even if you’re trying to show that a method can generalize to new (i.e. unseen) languages.
I was frustrated that Wang & Eisner had put out a dataset with this kind of train/test split and also that two sets of TACL reviewers (Wang & Eisner 2016 and 2017 both appeared in TACL) missed it. Based on the discussion I’ve had with Wang & Eisner, on twitter, github and over email, it’s clear that they had thought about the question. I think the reviewers should have asked, which I think would have prompted Wang & Eisner to include the discussion there.
Wang & Eisner point out that in the work where they set that train/test split they aren’t claiming language independence (especially in the 2016 paper) and so repeating languages between train/dev and test isn’t as problematic as I said. (For some of that, see the exchange on the GD github page.) They agree that it is misleading to confuse “language” and “treebank” in their paper and talk about “37 languages” when the 37 UD 1.2 treebanks represent 33 languages. They have tweeted that they are working on an update to clarify this — I look forward to that!
I was most frustrated that the authors of the original paper I was reading picked up Wang & Eisner’s train/test split and apparently used it without considering whether it was appropriate to test their hypothesis with.
From the massive twitter megathread that followed, I think the following are the most important points:
1. Language independence
There is on-going disagreement as to what this term means. When I encounter it, I assume it means “works reasonably similarly well across all human languages”, unless it’s modified to suggest a narrower set (widely spoken languages, Indo-European languages, etc.). Yoav Goldberg in this thread (and others before) have asserted that means something more like “can be run on any human language”. This strikes me as a singularly uninteresting thing to claim and also an unlikely interpretation when the claim comes together with e.g. fluff about how humans learn language. Even without that fluff, just in the context of “can work without labeled training data” suggests something more like my interpretation. Because if what you’re saying is “will run without labeled training data, but will give back utter garbage for all I know”, how is that an interesting claim?
More generally: I don’t actually care whether or not people develop language independent NLP systems (in my sense). (Okay, maybe I care a little — the Grammar Matrix project is after all pushing in that direction and we do test each addition with held-out languages from held-out language families.)
But in the broader NLP context, what I care about is that people match their experiments to their claims. So if you want to claim language independence, then you’ll need to spend some time thinking about how you’ll establish it. Training something on English and testing on French just isn’t going to cut it. (For more on this, see my 2011 paper in LiLT.)
Now, the experiments we can run are limited by the resources we have available. The UD project is making things much better than they have been, but the UD languages (as noted above) aren’t chosen to be a legitimate typological sample. However, we can still pay attention to how to make the most of the resources we have. And given that closely related languages are structurally very similar to each other, a good first step would be to keep the subfamilies of Indo-European together, i.e. all in train or all in test. I’ve emailed with Joakim Nivre about making language family information more immediately obvious on the UD webpage and he says he’ll work towards that in an up-coming release. Yay!
[Update 11/17/17: UD 2.1 was released two days ago and now the Universal Dependencies web page foregrounds language family and subfamily information.]
2. Look at your data
I contend that we can’t do good science if we don’t understand the data we are working with. Say technique X works better than technique Y on some task. That information isn’t all that useful without some sense of why X should work better than Y. What kind of information is it capturing that Y isn’t? Those kinds of questions can’t be answered without getting our hands dirty with data — i.e. with careful error analysis. X and Y might be the published system and the previous state of the art. Or they might be two iterations of a system under development. Careful error analysis can help point to how a system can be improved, too.
In this connection, Hal Daume III tweeted:
I can understand that there are good reasons for keeping test data blind. But then error analysis should be done on dev data. And *someone* has to have responsibility for making sure the test data is appropriate for testing the claims at hand. That means knowing something about the data, even if you don’t examine it in detail. And knowing what language(s) the data is drawn from is the very minimum.
3. Cite the linguistics literature
One of my other frustrutations with the way NLP and linguistics do(n’t) interact is the lack of citaitons from NLP into the linguistics literature. There’s two kinds of ways this happens: First, NLP authors might not know enough linguistics to be able to do the literature search to find the linguistic work that could actually inform their NLP. Second, NLP authors might know the relevant linguistic points themselves, but then just write about them as if they were common knowledge (e.g. there are typological tendencies by which languages with OV order tend to have NP-P order within adpositional phrases…). This obscures the fact that these are actually research results which (a) reflect people’s work & ingenuity and (b) should be traceable and checkable.
4. Talk to experts!
Part way through the thread, Brendan O’Connor tweeted:
My first response to this is to say that the solution is obvious: Collaborate with experts in X! And I think that’s the solution. I think the tweet points to a deeper problem though. It seems to suggest (note the hedge: this is not what Brendan said) that some people in CS think that it’s okay to publish work on X that you wouldn’t be comfortable having critically reviewed by experts in X. If that is a belief that’s out there, it strikes me as terribly arrogant.
Even more importantly, though: If the point of using NLP or any other area as an application area for ML is to show what ML can do, then I don’t think this can be done effectively without having domain experts in the mix. If an ML person who doesn’t know anything about X creates a dataset related to X and “solves” the task the dataset represents … that tells us precisely nothing. And even if the ML person didn’t create the dataset, without serious discussion of what the dataset represents and how it relates to some larger problem of interest, we’ve learned nothing.
The pithiest way of saying this that I’ve come up with is:
5. This is reviewers’ responsibility too
I’m asking for culture change here and that has to come in part through our reviewing. A lot of this connects to Bonnie Webber’s points in this comment on the COLING 2018 PC blog post about review forms. Specifically she says that reviewers should ask the following the questions:
– Is it clear what the authors’ hypothesis is? What is it?
– Is it clear how the authors have tested their hypothesis?
– Is it clear how the results confirm/refute the hypothesis, or are the results
– Do the authors explain how the results FOLLOW from their hypothesis (as
opposed to say, other possible confounding factor)?
My version of those includes:
- Is it clear what data was used to test the hypothesis?
- Is that data appropriate for testing the hypothesis?
Beyond that, I’d add:
- Is the work grounded appropriately in the relevant literature, not just from NLP/ML but also from linguistics?
The first megathread seemed to wrap up on Sunday night, but then a new one was launched, by Kyunghun Cho:
This one turned into another iteration of the debate about the relationship between linguistics and NLP.
Ryan Cotterell (@_shrdlu_) put a lot of energy into trying to convince everyone that NLP isn’t “interdisciplinary” based on an extremely stringent definition of “interdisciplinary” (must build on current work from both disciplines) and current practice in NLP (most of it doesn’t).
To this, I’d like to make the following replies: An area of research is interdisciplinary in principle if the questions it asks require expertise from multiple areas to effectively approach. It is interdisciplinary in practice if that expertise is actually applied. A field that is interdisciplinary in principle but not in practice is clearly in trouble.
By my definition, NLP is interdisciplinary in principle and I agree with Ryan that it is not sufficiently interdisciplinary in practice, though I don’t think it needs to hit his high bar to succeed. (And he doesn’t think that’s required either.) Similarly, I don’t think that all subfields of linguistics are equally relevant to NLP — and any given subfield of linguistics will be potentially relevant to only certain parts of NLP.
My message is: Learn something about how language works and/or collaborate with people who have that expertise; it’ll lead to be better NLP.
The argument in the second megathread struck me as terribly counter-productive, especially since most of the people tweeting were people who usually support the notion of more linguistically-informed NLP. The high-bar definition of “interdisciplinary” is unhelpful: I don’t want people walking away thinking, “If I can’t undesrtand a grad-level class in linguistics, then I can’t do interdisciplinary work.” Nor do I want people taking away the impression “linguistics is irrelevant.”
Perhaps the most frustrating part of this thread was the way in which it erased both the areas of linguistics and the areas of CL/NLP that I personally work in. For much of it, it seemed that “linguistics” was equated with “current Chomskyan syntax” (there was even a subthread about whether or not sociolinguistics counts as linguistics (!)). On the other hand, the bald assertations that NLP in general (as extrapolated from “most work in NLP”) doesn’t use linguistics suggests that the work that does (which includes mine) doesn’t “count”.
So hey world, there’s more to linguistics than Chomsky’s latest. And we should expect more of NLP than just numbers on some black-box dataset that’s trendy right now on arXiv.