Can ChatGPT do crossword puzzles?

Testing whether ChatGPT is simply a stochastic parrot or a logical reasoner using LangChain

Published in

Data Science at Microsoft

9 min readApr 2, 2024

ChatGPT is a state-of-the-art natural Large Language Model (LLM) that can produce fluent and coherent text on diverse topics and domains. With more than 100 million users worldwide already it is one of the fastest growing products ever, empowering individuals to write, synthesize, and learn faster and better.

As tech companies race to create the next best AI (artificial intelligence), ChatGPT stands to benefit from some of the resulting improvements as well. LLMs have already been criticized for parroting existing text without deliberate and logical inference. One author who has called ChatGPT a “stochastic parrot” is Emily M. Bender, a professor of linguistics at the University of Washington. She co-authored a paper titled “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” in which she argues that LLMs do not understand the meaning or context of the texts they generate.

So, are LLMs ever capable of understanding the meaning and logic behind the words they generate, or are they merely stochastic parrots that mimic the patterns of their training data? In this article, I explore some of the challenges and limitations of ChatGPT and some of the new ways to test its capabilities and evaluate its performance. I argue that ChatGPT is neither a stochastic parrot nor a logical deliberator, but rather a probabilistic predictor that can sometimes surprise us with its creativity and sometimes disappoint us with its inconsistency.

Devising a rigorous test for ChatGPT: Punny crossword puzzle clues

According to some experiments conducted by crossword enthusiasts, ChatGPT can sometimes guess the correct word for a crossword clue, but it often fails or gives bizarre or wrong answers. ChatGPT also does not understand the structure or rules of crossword puzzles, such as number of letters of a word, position of a word, or the theme of the puzzle. For ChatGPT to improve its language skills to the level of native speakers and show human-level reasoning, crossword puzzles present a meaningful challenge.

I liken ChatGPT’s current abilities (or lack thereof) for solving crossword puzzles to those of new learners of the English (or any other) language. As an ESL (English as a second language) person myself, I remember that it was near impossible for me to solve crossword puzzles when I was first learning the language. Going back in time and thinking about what helped me get over the hurdle, I can point to three main developments:

Cultural references: As I became more immersed in American culture, it became easier for me to understand the answers to clues such as “a Stooge” (answer: Moe).
Synonyms, antonyms, homophones, anagrams, or idioms: As I developed a deeper understanding of and vocabulary in English, I was more able to answer clues such as “drop the ball” (answer: err).
Puns and wordplay: This is the hardest part of a crossword puzzle, where the author uses a pun or wordplay that requires a shift in the logic. For instance, the clue “star sign?” is a wordplay where the question mark refers to a tricky clue that requires a second look and that the clue should not be taken literally. If we were to take this clue literally, we would think of the zodiac signs. But the question mark signals that we need to go one step forward and think deeper: star (literally a famous person) and sign (literally their signature) in this case means “autograph.”

If an ESL person can master the three tricks above by improving their language skills, can ChatGPT do the same? And what are some strategies from a prompting perspective that can help improve the model’s output? Next, I test ChatGPT with the crossword puzzle clues above using different prompts to see whether it can solve different crossword puzzles and whether it can explain its reasoning coherently.

Testing ChatGPT’s crossword puzzle skills

Let’s start by testing ChatGPT with the crossword puzzle clues above to see how well it performs. Here I use LangChain to leverage the power of ChatGPT. The LangChain Python library is a natural language processing (NLP) library that provides a wide range of tools and functions for processing and analyzing natural language data. It provides a simple interface for accessing OpenAI’s language models through the OpenAI API (application programming interface). LangChain also allows you to compare different prompts and models dynamically, and to integrate your chatbot with external data sources and software workflows.

Here is a simple code snippet to query ChatGPT with our first crossword clue:

# Import necessary modules from Langchain and other libraries 

from langchain.chat_models import ChatOpenAI 
from langchain.prompts import ChatPromptTemplate 
from langchain_core.output_parsers import StrOutputParser 
from langchain.llms import OpenAI 
import pandas as pd 

key='mykey' #replace with your own Open AI API key 
os.environ["OPENAI_API_KEY"] = key 
prompt = ChatPromptTemplate.from_template("Answer the following crossword puzzle clue: {clue}") 

model = ChatOpenAI(model_name='gpt-4') 
output_parser = StrOutputParser() 
chain = prompt | model | output_parser 
chain.invoke({"clue": "a Stooge"}) 
chain = prompt | model | output_parser 
chain.invoke({"clue": "a Stooge"})

ChatGPT’s answer: Moe

Pretty good! ChatGPT already has the cultural context I lacked as an early ESL person to deduce that Moe is a Stooge (The Three Stooges were an American comedy trio active from 1922 until 1970). This is not surprising, given the large amount of data ChatGPT was trained on that is available on the internet. You can also immediately see the benefit of using LangChain here: Once I develop the prompt, all I have to do is to invoke that “chain” with the next clue, like so:

chain.invoke({'clue': 'drop the ball'})

ChatGPT’s answer: Fail or make a mistake

Good answer, but not the answer we are looking for because the space for the answer has only three letters. We need more synonyms and to let ChatGPT know the length of the word we are looking for. So, I write another prompt:

prompt2 = ChatPromptTemplate.from_template("Answer the following crossword puzzle clue: {clue} and ensure that the word is {number} letters long.") 
chain2 = prompt2 | model | output_parser 
chain2.invoke({"clue": "drop the ball", "number": "3"})

ChatGPT’s answer: Err

Yes! Adding the length of the word to the prompt worked like a charm. This is particularly helpful, of course, because when we solve the crossword puzzle, we know exactly how many letters we need to fill in. Additionally, ChatGPT got the context that an early ESL person may lack immediately: It is very well versed in synonyms and idioms.

To be fair, both clues were on the “easy” side because we are playing against ChatGPT’s strengths: Cultural context, deep vocabulary, synonyms, and idioms are ChatGPT’s bread and butter. I would be very surprised if the training data did not contain multiple English language dictionaries. So, we still have not tested ChatGPT’s abilities fully by using wordplay or puns that are a lot harder for humans to understand as well. These require logical deduction and not mere knowledge. Let us try the next clue for this purpose:

chain2.invoke({'clue':'star sign?', "number": "9"})

ChatGPT’s answer: Capricorn

Good thought, ChatGPT, but it is both too literal and not literal enough as an interpretation of the clue. We need to get ChatGPT to understand that the question mark at the end of the clue means it is tricky, and that it involves a pun or wordplay. I will add that to the next prompt:

prompt3 = ChatPromptTemplate.from_template("Answer the following crossword puzzle clue: {clue} and ensure that the word {number} letters long.\
If the clue has a question mark at the end, it means it will be a tricky clue, maybe it will involve a pun or a word play, so be extra careful. \
Also, try to give multiple options for an answer. ")
chain3 = prompt3 | model | output_parser
chain3.invoke({'clue':'star sign?', "number": "9"})

ChatGPT’s answer:

1) Capricorn

2) Aquarius

OK, the answer is still eluding ChatGPT. What is worse, “Aquarius” does not even have nine letters. We already know that ChatGPT is not great at math, so this is not a huge surprise. The more alarming part of this answer is that ChatGPT does not seem to logically process the information: “Star” in this case means a famous person and “sign” literally means “signature,” making the answer an “autograph.” This is a very tricky clue that would trip up many humans as well. So how do we at least try to get ChatGPT to come closer to the answer? For this, I go back to my early ESL years and think about what helped me: Think of each word’s alternative meaning, stretch the grammar rules. “Star sign?” is not as grammatically correct as “what is the signature of a star?” but I can get from “Star sign?” to that deduction. How can I get ChatGPT to do that? Here is an attempt:

prompt4 = ChatPromptTemplate.from_template("Answer the following crossword puzzle clue: {clue}, and ensure that the word {number} letters long.\
If the clue has a question mark at the end, it means it will be a tricky clue, maybe it will involve a pun or a word play, so be extra careful. \
What else can the clue mean? If you put the words together, can it become a sentence? \
Also, try to give multiple options for an answer. ")
chain4 = prompt4 | model | output_parser
chain4.invoke({'clue':'star sign?', "number": "9"})

ChatGPT’s answer:

The clue “star sign?” could refer to a Zodiac sign, a constellation, or even a celebrity’s autograph. However, considering the condition that the answer must be 9 letters long, the answer could be:

1. CAPRICORN: This is a Zodiac sign that is represented by the Sea-Goat and is associated with individuals who are born between December 22 and January 19.

2. SAGITTARI: This is a shortened form of Sagittarius; a Zodiac sign represented by the Archer and is associated with individuals who are born between November 22 and December 21.

3. AUTOGRAPH: This could be a tricky interpretation of the clue, referring to a ‘star’ (celebrity) ‘sign’ (autograph).

Remember, crossword clues are often cryptic and can have more than one meaning. They can sometimes be literal but are often metaphorical, requiring some lateral thinking. The answers were reached by thinking outside of the box and examining other possible meanings of the words “star” and “sign,” beyond the most obvious association with the zodiac. Autographs are often associated with stars (celebrities); therefore, this answer could potentially fit the tricky clue.

Bingo! ChatGPT is able to deduce that “autograph” is a potential answer to the clue with a slightly more elaborate prompt. This is quite impressive in my opinion, not the least because the solution occurred to ChatGPT after the longer, more elaborate prompt. The explanation as to how it arrived at this solution is also quite impressive: It went beyond the zodiac to test other meanings of star and sign.

A more rigorous test? Tweaking the clue to make it less likely to have been in the training set

Several scientists are working on truly testing the “intelligence” of ChatGPT. One brilliant way to do so that they have developed is the reversal of a question, which usually tends to reduce the likelihood of the answer being in the training data and is a better test of the logical deduction abilities of ChatGPT. For example, ChatGPT does a lot worse when asked “Who is Mary Lee Pfeiffer’s son?” than “Who is Tom Cruise’s mother?” because the former is much less likely to appear in the training data than the former. This also shows that ChatGPT does not have the ability to deduce the reverse meaning from the first correct answer. It cannot simply say “if Mary Lee Pfeiffer is Tom Cruise’s mother, Mary Lee Pfeiffer’s son should be Tom Cruise”. A simple deduction like this is beyond the reach of the model.

I tried a similar approach with my final clue:

chain4.invoke({"clue": 'sign star?', "number": "9"})

ChatGPT’s answer: ASTROLOGY

Explanation: The clue asked for a “sign star?” which could be interpreted as a reference to star signs or zodiac signs. The study of these signs is known as astrology, which fits the length requirement of 9 letters.

Another possible answer: CELESTIAL

Explanation: The term “celestial” is often used to refer to stars and the sky, while “sign” could be a hint towards the celestial signs in astrology. Therefore, “celestial” could be a creative answer to the clue “sign star?”.

Not great. ChatGPT fell into the reversal curse. It is not very intuitive to deduce autograph from “sign star?” but the fact that it was able to do so from “star sign?” tells us it should have been able to do it if it was using some sort of logic or intelligence.

What does this mean and why should we care?

It is difficult to know how ChatGPT produced the solution even with the explanation it provided. It could have dug deeper into its cryptic crossword puzzle answers knowledge after the more elaborate prompt to find the answer in the training data. Alternatively, it could have developed a deeper understanding of the English language to be able to deduce that “star” means “celebrity” and “sign” means “signature,” and therefore “star sign?” with a question mark in the context of crossword puzzles can mean an autograph. The latter is unlikely because we do not have any evidence to corroborate it, and we have evidence to the contrary as shown above.

This does not negate all the productivity improvements and life hacks that ChatGPT is already helping millions of humans with. It is an extremely useful tool that makes our lives easier every day. It simply means that ChatGPT is not “intelligent” or “logical” in the sense that we understand humans to be. And it does not have to be. ChatGPT will get better and better at making our lives easier and better regardless of its deductive capabilities.

Ceren Altincekic is on LinkedIn.