Two minutes NLP — Making Large Language Models reason with Chain of Thought Prompting
Prompt engineering and chain of thought prompting for arithmetic, symbolic, and commonsense reasoning
Hello fellow NLP enthusiasts! As large language models became able to do few-shot and zero-shot tasks, we learned that different prompts achieve different results and we started talking about “prompt engineering”. Today we see how different prompts result in very different performance on many reasoning tasks such as arithmetic, symbolic, and commonsense reasoning. Enjoy! 😄
Chain of Thought Prompting
Scaling large language models has improved their performance on a variety of NLP tasks. However, not all tasks experienced the same improvements, indeed the largest models currently struggle with certain reasoning tasks such as math word problems, symbolic manipulation, and commonsense reasoning.
One interpretation of this finding is that large language models successfully perform system-1 tasks, which are done quickly and intuitively by humans. However, system-2 tasks require slow and deliberate thinking (often with multiple steps) and include logical, mathematical, and commonsense reasoning tasks, among others. Language models struggle on system-2 tasks, even when scaled to hundreds of billions of parameters, achieving flat scaling curves (i.e. simply increasing model scale does not lead to substantive performance gains).
It turns out that prompting language models with chains of thoughts, similar to how a person reasons, greatly improves their performance on reasoning tasks, defeating the flat scaling curves previously observed.
The intuition is that a chain of thought allows language models to decompose a multi-step problem into intermediate steps that are solved individually, instead of solving an entire multi-hop problem in a single forward pass.
Here is an example of a chain of thought prompt:
Advantages of Chain of Thought Prompting
Chain of thought prompting has several attractive properties:
- Allows models to decompose multi-step problems into intermediate steps, which means that additional computation can be allocated to problems that require more reasoning steps.
- Provides an interpretable window into the behavior of the model, suggesting how it might have arrived at a particular answer and providing opportunities to debug where the reasoning path went wrong.
- Can be used for tasks such as math word problems, symbolic manipulation, and commonsense reasoning, and is applicable to any task that humans can solve via language.
- Can be readily elicited in sufficiently large off-the-shelf language models simply by including examples of chain of thought sequences into the exemplars of few-shot prompting.
Experimental Results
For six reasoning tasks where standard prompting has a flat scaling curve, chain of thought prompting leads to dramatically increasing scaling curves for sufficiently large language models.
Experiments were run with two large language models: LaMDA and PaLM.
Arithmetic reasoning
Testing LaMBDA and PaLM with simple arithmetic questions, chain of thought prompting achieves slightly better results.
The main advantages come when the models are asked more difficult arithmetic questions, such as the ones from the MultiArith and GSM8K datasets.
Both LaMBDA and PaLM achieve a great improvement with chain of thought prompting with respect to standard prompting.
Symbolic Reasoning
For symbolic reasoning, consider the tasks of last letter concatenation, reverse list, and coin flip shown in the next image.
Again, all of them greatly improve using chain of thought reasoning.
Commonsense Reasoning
Last, consider the following example commonsense reasoning tasks: commonsenseQA, strategyQA, date understanding, and sports understanding.
Here are their performance with standard and chain of thought prompting.
Conclusions and next steps
This work underscores that standard prompting only provides a lower bound on the capabilities of large language models. What other prompting methods might expand the range of tasks that language models can solve?
Possible next steps are:
- Learn about the PaLM model;
- Learn about decoding strategies for language models.