An NLU benchmark should progress NLP performance in conversation, making it as accurate as mathematics on computer.
My SuperGLUE benchmark article notes that the consortium doesn’t ask questions in language, or generate answers in language. It is more of a test of search, rather than natural language understanding (NLU), which could explain the observable limitations in conversational AI that is using technology that is improving at the GLUE benchmark.
I was immediately asked what a benchmark for natural language understanding should look like.
The benchmark for natural language processing (NLP), which should be comprised of NLU and natural language generation (NLG), should test language, not knowledge. What’s the difference?
Language allows communications to take place, leveraging shared information in context during conversation. Knowledge — detailed experience on topics — is important in discourse, but these NLU tests should focus on language while also introducing knowledge as a means of extending context. Put another way, we can talk to people with language on topics we know nothing about, and learn through the process.
Getting NLU right enables knowledge to be entered into conversation naturally, but…