Evaluation in Natural Language Processing (and tennis rackets in a world with no gravity)
Yoav’s recent blog post sparked a lot of interest across different communities, and many have chipped in since with opinions about ArXiv and what happens when scientific communities meet. Yoav’s post, however, is also about poor evaluation, and a good excuse for reminding everyone about what, generally, we expect of scientific papers in natural language processing. I want to take the opportunity to briefly touch on the importance of baselines, reporting comparable scores, not messing with test data, the importance of multiple test datasets, the importance of introducing new datasets, and the importance of error analysis.
Most papers in natural language processing precent new models or architectures, and right now a lot of them are neural networks. Models are generally compared to baseline models; some of which are random or majority class predictors, some of which are state-of-the-art systems from previous work. It is possible to get papers accepted at top conferences without beating the hell out of everyone else. Baselines are sanity checks. Random baselines, majority baselines, and state-of-the-art baselines are all sanity checks that your scores are reasonably good/meaningful, and you’re not getting what seems to be good scores, because you have learned just a few simple associations that were already known. However, baselines are also important for motivating your model, as a form of ablation test. Good, natural baselines are therefore as similar to your system as possible, so that we can quantify the exact contribution of the novelties you propose. In general, I tell my students to spend as much time on their baselines, as they do on their systems; and I believe this oughta hold for established researchers, as well.
One of my favourite examples of poor evaluation in deep learning approaches to natural language processing is a paper on sentiment analysis, from one of the big groups. The authors evaluated their model on a single, established, but, in my opinion, rather poor dataset, but that was not the biggest problem. The authors compared their system A with a baseline B to another system C with a baseline D, by comparing the error reduction of A over B with the error reduction of C over D. At no point in the paper, the authors mentioned any absolute numbers. While error reductions are generally more informative than absolute scores, an error reduction of 10% over a baseline with an error rate of 80% is not the same as an error reduction of 10% over a baseline with an error rate of 20%. It is clearly easier, in natural language processing, to achieve the first. At the same, it is also easier to obtain an error reduction of 10% over a baseline with error reduction of 1% (which will probably not be significant, though). The point is that numbers should be comparable, also with future work. Human judgments are extremely important, but you cannot compare judgments by two different sets of judges. Also, you cannot compare numbers across different datasets, you cannot compare systems across system-dependent measures, and you can — for fuck’s sake, as Yoav would put it — not compare error reductions over different baselines or across different datasets.
Not messing with the data
I have seen a few papers recently that replaced proper evaluation with simply presenting cherry-picked examples. This is obviously bad practice. When papers only evaluate systems on a small subset of the publicly available, standard datasets, it also comes across as cherry picking. Newcomers may not know what datasets are out there, but that’s no excuse if you want to prevent selective reporting. When you buy products A and B on Amazon, Amazon will suggest, based on user statistics, that you also buy product C. This week a lot of people have suggested various ArXiv lay-overs. My 10 cents layover would be a recommender system proposing that, if you evaluated your algorithm on datasets A and B, you also evaluate your algorithm on dataset C, with C a publicly available dataset, shipped in roughly the same format as A and B are shipped. Cherry picking opportunities explode if you allow for filtering test data. In parsing and machine translation, it has been considered okay-ish practice for decades to evaluate on short sentences, for example. This increases the chance of false positives quite dramatically, as we have shown in this paper.
The importance of multiple datasets
Significance over data points is very different from significance over datasets. A significance test over the sentences in WSJ Section 23 says something about the likelihood that system A is better than baseline B on the next WSJ 23 sentence you see. In reality, you want your system to do better than the baselines on the next dataset you see. To run a significance test over datasets, you need about a dozen datasets, however. For some tasks, this is possible; for others, you can imitate such datasets by doing robustness tests and by checking whether your models also lead to downstream gains.
The importance of introducing new datasets
Lilian Lee recently mentioned this in a plenary session at one of our top conferences. We need new datasets, constantly, even on tasks where we have multiple datasets available. One reason is drift. Improvements on Europarl texts from 1999 do not imply improvements on Europarl texts from 2006. Also, while most people set their parameters on held-out data (this goes without saying), we keep reading about results on standard benchmarks, and we need to refresh those benchmarks periodically to avoid community-wide over-fitting (to single datasets or averages over dozens).
Finally, error analysis. An improvement by 10% can mean very different things. With good baselines and multiple datasets, we can establish that the 10% improvement is neither random fluctuation nor an easy win, but the 10% improvement may still be on data points that are uninteresting for downstream applications (say learning that German public institutions tend to be located in Berlin, for geo-location). Error analysis is also a community service, telling researchers where to go next.
One possible excuse for lowering standards a bit, is that explaining thorough evaluation takes up space, and with radically new models, the standard 8 pages may not leave a lot of wiggle room to introduce new datasets or for proper analysis. However, even if this was an excuse for the first batch of deep learning papers in natural language processing (I am not sure it was), deep learning is now completely standard (to the extent that it is now cool and hip to stick to linear models) — and if you take out the LSTM equations of your papers, I am sure there’s room for some more analysis.
That said, I have written several papers that do not satisfy all the above requirements. Science is the art of the impossible, said Popper, and many reported findings, despite our efforts, turn out to be false. Nevertheless: Spending all your time on modeling and little or no time on evaluation, as a field, is like — uhm — designing tennis rackets in a world with no gravity.