Can chatGPT solve the Winograd Schema Challenge?

James Kelly
5 min readFeb 24, 2023

--

Image sourced from freepik.com

Challenge Description

For those unfamiliar with the challenge, here is another blog post with a good description: WSC. Briefly, a Winograd schema is a language problem involving a language pair: 1. a statement containing pronoun coreference, one that can’t be solved by simple selective reasoning, and 2. a set of options to resolve the pronoun. Here is a classic example:

The trophy could not fit into the suitcase because it was too small.

The task is to correctly answer the question: In the above statement, is “it” referring to the trophy or the suitcase?

This is an old problem developed by Terry Winograd, a pioneer of computational linguistics, in the 1970s! It’s a good problem because it captures a lot of what is really difficult about human language processing, in that it requires some degree of world-knowledge to resolve. In fact, to this day the best results have barely cracked 90% accuracy on the WSC273 dataset (Kocijan, et al. 2023), while humans generally score about 100%.

The most successful methods use large pre-trained language models that have been fined tuned on specially-crafted training sets. These training/test sets are generally small because it is not obvious how to create or capture good examples. Examples are not all equal either, as one researcher has pointed out (Kocijan, et al. 2020), sometimes the knowledge required to resolve the pronoun is more abstract or complex than others.

By now, if you are reading this you are all too familiar with the amazing generative capabilities of chatGPT, as well as the areas it falls short. The question we will answer in this post is whether chatGPT, without any fine-tuning, can score better than some of the successful methods described in the literature. There is decent motivation for this experiment, as we know it is a LLM and subjectively appears to perform well on similar tasks. So here’s how we’ll go about it.

Setting Up the Experiment

First, to run the (python) code in this post you will need to go to the chatGPT website and get an api key to use the model. **you don’t the need the paid version to replicate the experiment in this post though you may have to break up your examples over time to avoid hitting the cap of 60/min**

They offer a handful of language models and text-davinci-003is listed as the biggest/best, so that’s what we’ll use. We will use the dataset on huggingface as it is nicely packaged for us. Here is the code for getting acess to the openai models and hf dataset:

from datasets import load_dataset
import openai
import os

openai.api_key = os.getenv("OPENAI_API_KEY")
wsc273_dataset = load_dataset('winograd_wsc', name='wsc273', split='test')

Here is one of the 273 examples from the dataset, similar to the one above:

{‘text’: “The trophy doesn’t fit into the brown suitcase because it is too large.”, ‘pronoun’: ‘it’, ‘pronoun_loc’: 55, ‘quote’: ‘it is too large’, ‘quote_loc’: 55, ‘options’: [‘the trophy’, ‘the suitcase’], ‘label’: 0, ‘source’: ‘Hector Levesque’}

Next, we need some way to convert the examples in the test set to prompt strings that we can pass to the model. This is really an extraction task but the chatGPT site has some examples for doing question answering that seem like a good fit, so we will construct our prompts to conform to that format. I broke it up to make it easier to read.

def query_format_helper(pronoun: str, answers: List[str]) -> str:
return f'Q: In the previous statement, does "{pronoun}" refer to {answers[0]} or {answers[1]}? A:'

def construct_query_from_schema(
text: str, pronoun: str, answers: List[str]
) -> str:
return f"S:{text} {query_format_helper(pronoun, answers)}"

query = construct_query_from_schema(example['text'], example['pronoun'], example['options'])

Now we need a way to pass an example to the model as prompts, and call it to generate a response (see what I did there? :))

def get_openai_answer(query_prompt: str, add_leading: bool = True) -> str:
init_prompt_starter = "The following are pairs of Winograd Schema in the form of a statement S, a question Q, and an answer A:"
init_prompt1 = "S: The cat went through the door, but it's tail got stuck. Q: In the previous statement, what does 'it' refer to? A: The cat."
init_prompt2 = "S: The cat tried to go through the door, but it was too small. Q: In the previous statement, what does 'it' refer to? A: The door."
init_prompt3 = "S: Fedex made more profit than UPS last year, but that was mostly due to the success of the new delivery system they implemented. Q: In the previous statement, what does 'they' refer to? A: Fedex."
init_prompt4 = "S: Sam tried to buy Xerxes lunch, but he wouldn't allow it. Q: In the previous statement, who does 'he' refer to? A: Xerxes."

# add leading prompts to cue model...
if add_leading:
query_prompt = f"{init_prompt_starter} {init_prompt1}, {init_prompt2}, {init_prompt3}, {init_prompt4}, {query_prompt}"

response = openai.Completion.create(
model="text-davinci-003",
prompt=query_prompt,
temperature=0,
max_tokens=200,
top_p=1,
frequency_penalty=0.0,
presence_penalty=0.0,
stop=["\n"]
)

return response['choices'][0]['text']

Testing and Results

Okay, now we just need to run a loop to pass the examples to the model, score the output as correct if the answer is in the output (both lower-cased), incorrect otherwise. We will use accuracy as our evaluation metric, to be consistent with the literature, calculated as num. correct / total examples. I did not include this portion of the program because it is relatively straightforward once you have the other bits set up. Time to run the code!

The results show chatGPT’s accuracy on this task is ~73% — better than random chance, and better than most attempts prior to the last 10 years, but not anywhere near state-of-the-art. Perhaps you could do better with some fine-tuning or a clever chain-of-command prompting, but data is limited in the first case, and extracting the relevant information is not straightforward in the second (i.e. concepts of space and object behavior not contained in the text itself). Another good idea is to use a model pre-trained/fine-tuned on NLI data, rather than just a ton of unrelated English text, but it has to be able to learn a useful representation of that world knowledge into its latent space.

Here’s an example chatGPT got wrong:

The lawyer asked the witness a question, but he was reluctant to repeat it.

I’m sure you could imagine a scenario where the “he” was referring to either the lawyer or the witness (the correct answer was the lawyer). Tricky indeed!

For good practice these were run several times, with tiny (< 0.1) differences in accuracy. In addition, some augmentation to the prompts was tried, as shown in the code for passing the prompt, where we lead the model a bit with a number of examples, again with no real change in accuracy.

Recently a paper was released showing how chatGPT is good at many things but not great at any of them (Kocón, et al. 2023). I think we found another instance of that. In any case I hope you enjoyed this foray into testing the latest NLP wunderkind. Let me know in the comments if you have any ideas for prompt engineering or otherwise improving the performance of chatGPT on this task!

--

--