Two minutes NLP — Making Large Language Models reason with Chain of Thought Prompting

Prompt engineering and chain of thought prompting for arithmetic, symbolic, and commonsense reasoning

Fabio Chiusano
NLPlanet
Published in
4 min readApr 8, 2022

--

Hello fellow NLP enthusiasts! As large language models became able to do few-shot and zero-shot tasks, we learned that different prompts achieve different results and we started talking about “prompt engineering”. Today we see how different prompts result in very different performance on many reasoning tasks such as arithmetic, symbolic, and commonsense reasoning. Enjoy! 😄

Chain of Thought Prompting

Scaling large language models has improved their performance on a variety of NLP tasks. However, not all tasks experienced the same improvements, indeed the largest models currently struggle with certain reasoning tasks such as math word problems, symbolic manipulation, and commonsense reasoning.

One interpretation of this finding is that large language models successfully perform system-1 tasks, which are done quickly and intuitively by humans. However, system-2 tasks require slow and deliberate thinking (often with multiple steps) and include logical, mathematical, and commonsense reasoning tasks, among others. Language models struggle on system-2 tasks, even when scaled to hundreds of billions of parameters, achieving flat scaling curves (i.e. simply increasing model scale does not lead to substantive performance gains).

It turns out that prompting language models with chains of thoughts, similar to how a person reasons, greatly improves their performance on reasoning tasks, defeating the flat scaling curves previously observed.

The intuition is that a chain of thought allows language models to decompose a multi-step problem into intermediate steps that are solved individually, instead of solving an entire multi-hop problem in a single forward pass.

Here is an example of a chain of thought prompt:

Chain of thought (highlighted) facilitates multistep reasoning in large language models. The output here is from a 137B parameter LaMDA language model. Image from https://arxiv.org/pdf/2201.11903.pdf.

Advantages of Chain of Thought Prompting

Chain of thought prompting has several attractive properties:

  1. Allows models to decompose multi-step problems into intermediate steps, which means that additional computation can be allocated to problems that require more reasoning steps.
  2. Provides an interpretable window into the behavior of the model, suggesting how it might have arrived at a particular answer and providing opportunities to debug where the reasoning path went wrong.
  3. Can be used for tasks such as math word problems, symbolic manipulation, and commonsense reasoning, and is applicable to any task that humans can solve via language.
  4. Can be readily elicited in sufficiently large off-the-shelf language models simply by including examples of chain of thought sequences into the exemplars of few-shot prompting.

Experimental Results

For six reasoning tasks where standard prompting has a flat scaling curve, chain of thought prompting leads to dramatically increasing scaling curves for sufficiently large language models.

Experiments were run with two large language models: LaMDA and PaLM.

Arithmetic reasoning

Testing LaMBDA and PaLM with simple arithmetic questions, chain of thought prompting achieves slightly better results.

When scaling up the model already facilitates good performance, chain of thought prompting does as well or better. Image from https://arxiv.org/pdf/2201.11903.pdf.

The main advantages come when the models are asked more difficult arithmetic questions, such as the ones from the MultiArith and GSM8K datasets.

Examples of correct and incorrect chains of thought produced by LaMDA 137B on the GSM8K dataset. Image from https://arxiv.org/pdf/2201.11903.pdf.

Both LaMBDA and PaLM achieve a great improvement with chain of thought prompting with respect to standard prompting.

Employing chain of thought enables language models to solve challenging math word problems for which standard prompting has a mostly flat scaling curve. Image from https://arxiv.org/pdf/2201.11903.pdf.

Symbolic Reasoning

For symbolic reasoning, consider the tasks of last letter concatenation, reverse list, and coin flip shown in the next image.

Few-shot exemplars for chain of thought prompting for symbolic reasoning datasets. Chains of thought are highlighted. Image from https://arxiv.org/pdf/2201.11903.pdf.

Again, all of them greatly improve using chain of thought reasoning.

For three symbolic reasoning tasks, employing chain of thought facilitates good performance when standard few-shot prompting is insufficient. Image from https://arxiv.org/pdf/2201.11903.pdf.

Commonsense Reasoning

Last, consider the following example commonsense reasoning tasks: commonsenseQA, strategyQA, date understanding, and sports understanding.

Few-shot exemplars for chain of thought prompting for commonsense reasoning datasets. Chains of thought are highlighted. Image from https://arxiv.org/pdf/2201.11903.pdf.

Here are their performance with standard and chain of thought prompting.

Compared with standard prompting, chain of thought prompting also improves performance on various types of commonsense reasoning tasks. Image from https://arxiv.org/pdf/2201.11903.pdf.

Conclusions and next steps

This work underscores that standard prompting only provides a lower bound on the capabilities of large language models. What other prompting methods might expand the range of tasks that language models can solve?

Possible next steps are:

Thank you for reading! If you are interested in learning more about NLP, remember to follow NLPlanet on Medium, LinkedIn, and Twitter!

--

--

Fabio Chiusano
NLPlanet

Freelance data scientist — Top Medium writer in Artificial Intelligence