E1 : Chain of Thought Prompting

Praveen Thenraj
Research Papers Summarized
4 min readMay 24, 2023

--

Decomposing a complex problem into multiple smaller reasoning steps — stitched together as a chain of thoughts(prompts) for LLMs.

Paper Name : Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

URL : https://arxiv.org/abs/2201.11903

Authors : Google Research, Brain Team — Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou

Conference : NeurIPS 2022

Problem Statement :

  1. Despite the massive improvements that LLMs have brought to NLP, its inability to handle arithmetic, common sense and symbolic reasoning has always been a concern.
  2. Scaling up the size of LLMs has not helped either.
  3. Though in-context learning (zero/few-shot learning) has helped avoiding fine-tuning LLMs for many domain specific tasks, still instilling reasoning capabilities in LLMs has been challenging

Solution :

  1. Decomposing a complex arithmetic/reasoning task into multiple intermediate steps — human way of approaching reasoning related tasks (Fig 1)
  2. Breaking complex reasoning tasks into steps using natural language rationales (Fig 1)
  3. Prompts consists of a triplet- input, chain-of-thought (CoT), output (Fig 1)
  4. Evaluation of the approach was done for arithmetic, common sense and symbolic reasoning.
Standard prompting — pair of question and answer as prompt ,Chain of Thought Prompting — breaks the problem into multiple reasoning steps and then derives the answer from the reasoning

Experimental Setup :

  1. LLMs evaluated — GPT 3 (ada, babbage, curie, davinci) , LaMDA, PaLM, UL2, Codex
  2. Inference Compute — LaMDA (TPU v3), PaLM (TPU v4), GPT (open API)

Arithmetic Reasoning Observations :
Benchmarks evaluated — GSM8K, SVAMP, ASDiv, AQuA, MAWPS

  1. The approach showed significant performance on complicated problems (GSM8K) rather than simple problems(MAWPS) beating the standard prompting technique.
  2. PaLM the biggest model (~540B) considered even managed to surpass prior best of supervised method on GSM8K, SVAMP, MAWPS benchmarks and was behind by just 2% for AQuA and ASDiv.
  3. 45 random samples for which the PaLM (62B) model gave wrong answers were considered and categorised into semantic understanding (20 errors), one step missing (18 errors), and other errors (7 errors). The “other category” included hallucinations, repetitive outputs, and symbol mapping errors
  4. Testing the results of the samples using PaLM (540B) showed that a considerable proportion of the samples have been fixed. This is inline with the hypothesis that LLMs acquires semantic understanding and logic reasoning as a function of model scale.

Common sense Reasoning Observations :
Benchmarks evaluated — CSQA, StrategyQA, Date understanding, Sports understanding, SayCan

  1. Considering PaLM (540B), the results achieved with CoT was much better than standard prompting in CSQA, Date and Sports understanding and SayCan datasets. It even surpassed human performance in Sports dataset.
  2. Not much difference in performance were noticed in CSQA dataset between standard and CoT, though CoT still managed to match prior supervised best.

Symbolic Reasoning Observations :
Symbolic reasoning evaluated — Last letter concatenation , coin flip

  1. Larger LLMs achieved almost 100% solve rate on coin flip tasks.
  2. The test samples of last letter concatenation contained toy tasks like concatenating last letters of two words as like prompt examples. It also contained out of domain (OOD) samples like concatenating last letters of three or four words.
  3. Performance gains of PaLM (540B) was significant in both in domain and out of domain test samples for both last letter concatenation and coin flip.

General Observations :

  • Chain of thought prompting showed significant gains only with LLMs of >100B parameters. Larger the scale of LLM, better were the results.
  • Large scale LLMs (>100B) acquire more semantic understanding, reasoning capabilities compared to LLMs less than 100B
  • Chain of thought prompting worked significantly with small set of example prompts (mostly 6–8 examples used in this paper)
  • CoT based inferencing remains robust to factors like different annotators designing the chain-of-thought prompts for the same examples, annotators without ML background ,order of the examples considered for prompting, number of examples for prompting, prompts from out of distribution of the test set.
  • No language models were fine-tuned as part of this process.

Conclusions :

  • The concern of lack of reasoning capabilities in LLMs has thus been addressed definitely to a considerable limit by tweaking the in-context learning (standard prompting) technique in this paper.
  • Decomposing a bigger problem into smaller chunks to understand them better and resolve them correctly has been the key behind the approach of this paper.
  • However reasoning capabilities increasing only with LLMs of scale greater than 100B parameters will still remain a concern.
  • Inducing considerable gains in reasoning capabilities using relatively smaller off the shelf LLMs can be a good space to explore.

--

--