SuperGLUE: The Slippery Benchmark with no Language Understanding

John Ball
Pat Inc
Published in
17 min readAug 17, 2019

--

Benchmark titles should reflect their purpose for AI

Today, the top computer chess programs are much better at chess than humans. The best human rating today is around 2900, while the best computers are in the 3000s. Humans just can’t track as many possible future moves as a computer with such accuracy. In short, humans are not as good at chess[i] as machines can be. Similarly, humans are slower than machines, with cars, rockets and jet airplanes winning almost every time.

There’s nothing wrong with losing to machines, but the final frontier is in the use of natural language. We’ve been waiting a long time for machines to converse with us, and Natural Language Understanding (NLU) has been seemingly stuck back in decade of the 1950s .

So, it is with great interest that we see in my area of expertise — NLU — the key benchmark tests have been extended from GLUE[ii] to SuperGLUE[iii]! The General Language Understanding Evaluation (GLUE) is a series of tests with a leaderboard put together by NYU, UW and Alphabet’s DeepMind. But the tests are being replaced because:

“performance on the benchmark has recently come close to the level of non-expert humans.”

The Facebook AI Research Team has joined the consortium.

You would imagine that if NLU benchmark tests are getting to human level, we are about to see intelligent robots talking to us soon. And it’s about time, too, since when I use today’s NLP technology I am almost always let down by the results.

These are GLU tests, general language understanding which must be better than NLU! How exciting!

I clicked on the superGLUE website and started to review the tests. Where are the questions? How are they exceeding the Facebook results from 2015?

The questions, for language understanding, are … multiple choice, text extraction and true/false?! Wait…what?

The GLU questions aren’t written in English, and the GLU answers aren’t in English either. The GLU tests are…search tests? While these tests gamify language into a format that today’s systems can pass, it factors out the part most people consider to be NLU: language understanding (and language generation and conversation). But since language isn’t mere data and words are more…

--

--

John Ball
Pat Inc

I'm a cognitive scientist working on NLU (Natural Language Understanding) systems based on RRG (Role and Reference Grammar). A mouthful, I know!