ehud reiter
Jun 13, 2017 · 2 min read

I saw that Mike White mentioned my name, so I thought I would comment directly. A lot of the discussion is about papers published in second-tier venues, but from my perspective there are also major problems with DL NLG papers published in top venues. Perhaps less drastic, but its a question of degree.

This was brought home to me last year when I attended NAACL 2016 (in order to give an invited talk on NLG evaluation), which was the first time I had been to an ACL event in several years. I went to listen to a NAACL paper about using DL for NLG, and was absolutely horrified.

(1) The evaluation was weak, because the authors just used BLEU, which is a questionable way to evaluate NLG systems (https://ehudreiter.com/2017/05/03/metrics-nlg-evaluation/)

(2) One of the main training corpora used was the output of a rule-based NLG system (https://ehudreiter.com/2017/05/09/weathergov/). So were the authors trying to show that they could use DL to reverse engineer a rule-based system and steal the IP of someone who spent a lot of time carefully hand-crafting NLG rules?

(3) The presenting author was completely unaware of previous work in the NLG community on the problems he was solving (this was apparent in the Q&A session as well as in the paper). He claimed his system was better than state-of-the-art, but to me his output texts looked considerably worse than stuff we were producing 15 years ago.

I am willing to be convinced that DL is a good approach for NLG, but I need to see experiments and papers with solid evaluation, sensible and appropriate corpora, and go0d awareness of NLG state-of-the-art. Papers like the above NAACL one dont leave me with a good impression of DL for NLG.

I’d also like someone to explain to me how we can evaluate the worst-case (as well as the average case) performance of DL systems, because this is really important (https://ehudreiter.com/2016/12/12/nlg-and-ml/).

Finally, to echo some of the other opinions which people have expressed, there is a caricature of a DL (or indeed ML) NLP researcher as someone who just wants some corpora and a way to keep score, and has no interest whether the “score” means anything and also no interest in the provenance or suitability of the coprora. I realise this is a caricature, but I think it has some truth, and I dont think this is the right attitude for making progress in NLP.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store