Comparing the Uncomparable to Claim the State of the Art: A Concerning Trend

Benjamin Marie
19 min readAug 16, 2022

Spotting evaluation errors and speculative claims in GPT-3, PaLM, and AlexaTM.

Step 1: put numbers in the magic hat ; step 2: take a magic wand to tap the hat ; step 3: state of the art go out of the hat ; Party smiley

In AI research, authors of scientific papers often choose to directly compare their own results with results published in previous work, assuming that these results are all comparable. In other words, researchers perform a simple copy of previous work’s numbers for a comparison with their own numbers, instead of reproducing days or even weeks of experiments, to demonstrate that their new method or algorithm improves over previous work.

This simple copy sounds very convenient and intuitive! What’s the catch?

In this article, I demonstrate that the convenience of copying previous work’s numbers often comes at the cost of scientific credibility. I describe how uncomparable numbers happen to be compared in scientific papers. Then, I review some concrete examples of flawed comparisons from very well-known and recent work, namely, Open AI’s GPT-3, Google’s PaLM, and Amazon Alexa AI ‘s AlexaTM. This part is a bit heavy in details, but I hope it can also help you to understand how to spot sloppy evaluations and speculative claims by yourselves in research papers.

This article doesn’t target a specific audience. I avoid technical details and try to make my demonstrations…

--

--

Benjamin Marie

Ph.D, research scientist in NLP/AI. Medium "Top writer" in AI and Technology. Exclusive articles and all my AI notebooks on https://kaitchup.substack.com/