Want to Know the Trick to Achieve Robust Winograd Schema Challenge Results?

A Surprisingly Robust Trick for Winograd Schema Challenge

Published in

AI³ | Theory, Practice, Business

3 min readMay 23, 2019

The Winograd Schema Challenge (WSC) was introduced for testing AI agents commonsense reasoning. It comprises a pair of sentences that differ in one or two words that contain an ambiguity to be resolved in contrasting ways which requires the use of world knowledge and reasoning.

A little History

The Winograd Schema Challenge(WSC) was proposed in the spirit of the Turing Test to test machine intelligence. WSC was proposed by Hector Levesque, an AI researcher after a computer program, Eugene, passed the Turing Test in a competition in 2014. As an improvement of the Turing Test, WSC is a multiple-choice test with questions that have a specific structure. Some examples;

The trophy would not fit in the brown suitcase because it was too big (small). What was too big (small)?
Answer 0: the trophy
Answer 1: the suitcase

The town councilors refused to give the demonstrators a permit because they feared (advocated) violence. Who feared(advocated) violence?
Answer 0: the town councilors
Answer 1: the demonstrators

A person who answers the question correctly would likely use knowledge about the objects or action in question as well as his ability to do spatial reasoning. However, for a machine to provide the correct answers, Levesque argues that it would the use of knowledge and commonsense reasoning.

Winograd Schema Challenge(WSC)

The challenge is administered by the commonsensereasoning.org and runs once a year. Participants who achieve 90% accuracy on two rounds get awarded a grand prize of $25,000; smaller cash prizes are also be awarded.

The task is challenging since WSC examples are constructed to require human-like commonsense knowledge and reasoning. The best-known solutions use deep learning with an accuracy of 63.7%

Achieving Robust Winograd Schema Challenge Results

Recently, researchers have shown that fine-tuning existing LMs on WSCR helps improve the capabilities of the LM to tackle WSC273 and WNLI. WSC273 and WNLI are popular benchmarks for natural language understanding and commonsense reasoning. They also introduced a method for generating large-scale WSC samples and used it to create an 11M dataset1 from English Wikipedia.

The approach is further used together with WSCR for fine-tuning the pre-trained BERT LM. They were able to achieve accuracies of 72.2% and 71.9% on WSC273 and WNLI, improving the previous best solutions by 8.5% and 6.8% respectively.

Potential Uses and Effects

This research work is the first model that beats the majority baseline on WNLI. It achieves improved results on the WSC and WNLI datasets by fine-tuning the BERT language model on the WSCR dataset. It has so much potential to help future Winograd Schema Challenge participants improve WSC and WNLI accuracies.

Thanks for reading. Please comment, share and remember to subscribe to our weekly AI Scholar Newsletter for the most recent and interesting research papers! You can also follow me on Twitter and LinkedIn. Remember to 👏 if you enjoyed this article. Cheers!