Comparing the Uncomparable to Claim the State of the Art: A Concerning Trend
Spotting evaluation errors and speculative claims in GPT-3, PaLM, and AlexaTM.
In AI research, authors of scientific papers often choose to directly compare their own results with results published in previous work, assuming that these results are all comparable. In other words, researchers perform a simple copy of previous work’s numbers for a comparison with their own numbers, instead of reproducing days or even weeks of experiments, to demonstrate that their new method or algorithm improves over previous work.
This simple copy sounds very convenient and intuitive! What’s the catch?
In this article, I demonstrate that the convenience of copying previous work’s numbers often comes at the cost of scientific credibility. I describe how uncomparable numbers happen to be compared in scientific papers. Then, I review some concrete examples of flawed comparisons from very well-known and recent work, namely, Open AI’s GPT-3, Google’s PaLM, and Amazon Alexa AI ‘s AlexaTM. This part is a bit heavy in details, but I hope it can also help you to understand how to spot sloppy evaluations and speculative claims by yourselves in research papers.
This article doesn’t target a specific audience. I avoid technical details and try to make my demonstrations…