The Importance of Chain-of-Thought Prompting

Cameron oeze
7 min readMay 18, 2023

--

Image credit: (https://www.freepik.com/premium-vector/thinking-man-statue-illustration-auguste-rodin-s-thinker_26772418.htm)

TL;DR

Chain-of-Thought prompting is a relatively new Large Language Model(LLM) prompting method first proposed by Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou in their paper, Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Published October 31, 2022, the paper discusses and compares the solve rate of math word problems using Chain-of-Thought prompting on 5 different LLMs versus using standard prompting. The researchers found that as the scale of the LLMs increased, the solve rate using Chain-of-Thought prompting surpassed the rates of those from standard prompting and in some cases, performed better than the prior supervised best rates of those LLMs. With LLMs becoming more available for public use and development, it is important for users wanting to take advantage of these technologies(for personal or commercial use) to understand what large language models are, be up to date on new ways to improve the performance of models such as with chain-of-thought-prompting, as well as understand the implications of these developments in the real world.

Background

So what exactly is a large language model? A large language model is essentially a huge transformer model that is trained on billions of text examples using a process known as self-attention. Self-attention is a learning feature of the transformer model that assigns a weight to each word in an input based on how meaningful that word is in the context of the others. Self-attention is an extremely crucial learning feature in LLMs and is what allows them to learn better than other models.

Example of self-attention. The color of the words represents a weight related to the word it_ (Image credit: https://jalammar.github.io/illustrated-transformer/)
High-level example of what a transformer model does (Image credit: https://jalammar.github.io/illustrated-transformer/ )

These LLMs have become incredibly significant and popular for researchers and machine learning fans in many different fields. Since these LLMs are trained on upwards of hundreds of billions of textual data for a specific function, it’s easy for a user of one of these models to simply use the LLM and have the model return a correct response(most of the time). This can save researchers hundreds of hours of wasted time as they won’t have to scour journals and the internet for answers. While these LLMs are incredibly powerful and can produce correct output at an eerily impressive rate, they are not 100% accurate with their outputs and can suffer from hallucinations and reasoning errors. Hallucinations are when an LLM output looks similar to a supposed correct output but contains wrong information. Some examples of this would be the model outputting incorrect answers for math problems, or providing an incorrect summary of a book.

Example of a LLM not always being correct ( Image from https://twitter.com/jalayrac/status/1524026443981901825 )

In order to avoid issues with hallucination and to improve the solving rate of these models, a field known as prompt engineering has been making large strides. By providing the input of an LMM examples of how it should display its output(known as few-shot learning), researchers can prompt the LLM to respond in a way that increases the solve rate of these models. Prompt engineering is a very important field to understand if you are interested in LLMs and NLP, as prompt engineering is driving the future of LLM development. Prompt engineering is helping these LLMs to become more accurate in comparison to previous learning mechanisms and also allows the user to fine-tune their outputs in a more specific way.

Method

Chain-of-Thought prompting, while a very new method in the prompt engineering space, is quickly becoming one of the most promising ways to create accurate LLMs. So how does Chain-of-Thought prompting actually work? Chain-of-Thought prompting works by prompting the LLM to break down its response into a series of intermediate steps in a process similar to how a person would break down a complicated math problem. Provided below is an example of a user asking a Chain-of-Thought prompted LMM how much money they should put in their bank account:

Example of how user input into a Chain-of-Thought prompted LLM

On a technical level, the input of LLMs using Chain-of-Thought prompting consists of providing the model with many few-shot learning examples of a problem being solved in multiple steps. The output that this prompting produces is the inputted question being broken down into several different reasoning steps before providing the answer. Depending on how a user wants their output broken down, they can supply the input with few-shot examples of different analysis breakdowns. In theory, breaking down the output piece-by-piece should lead to a higher solve rate, as the model provides “reasoning” for its answer. However, how does Chain-of-Thought prompting in LLMs actually lead to an increase in solve-rates?

Results

In the paper, Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, researchers took a look into the solve rates of LLMs using Chain-of-Thought prompting in comparison to standard few-shot prompting, the most popular prompting method previously. The researchers evaluated each of the prompted method’s solve rates using 5 popular math word problem benchmarks(GSM8K, SVAMP, ASDiv, AQuA, MAWPS)of varying questions on 5 popular LLMs(GPT-3, LaMDA, PaLM, UL220B, Codex) at different scalabilities(Ex: 442M,2B,137B parameters for the LaMDA model, up to 540B parameters with PaLM!). What the researchers found was that while the difference between the 2 prompting methods was negligible and sometimes worse with Chain-of-Thought prompting in smaller-scaled models, Chain-of-Thought prompting began having a positive impact on LLM solve rates on models above 100B parameters. Not only that, some combinations of benchmarks and models, such as PaLM and the MAWPS benchmark, even surpassed prior supervised best solve rates!

Figure depicting solve rates of standard prompting and Chain-of-Thought prompting vs model scale (Image Citation: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models )

Discussion

Based on the results from the paper, Chain-of-Thought prompting looks like an extremely promising prompting method for the machine learning field as a whole. With Chain-of-Thought prompting performing better than standard prompting in larger-scaled models, Chain-of-Thought prompting can be seen as a new standard prompting method for those LLMs. All researchers and users will be able to benefit from Chain-of-Thought prompting as it facilitates LLMs to output correct responses far more reliably. Chain-of-Thought prompting can be used in pretty much any large language model above 100B parameters and provides better solve rates than standard prompting.

Image: ChatGPT hallucinating the incorrect album for the song Accordion by MFDOOM
Image: ChatGPT using Chain-of-Thought prompting to correctly say the album the song Accordion by MFDOOM is on

Commercially, Chain-of Thought prompted models could be found in any field where multi-step reasoning is widely prominent. One area I think that Chain-of-Thought prompting can be very beneficial is education. Because of the nature of Chain-of-Thought prompting, LLMs trained using this prompting method are great for providing the reasoning for an answer. Educators can use these LLMs with Chain-of-Thought prompting to better help their students understand why an answer to their school work is right or wrong along with the reasoning behind it. Another field that I believe would benefit greatly is the medical field. Chain-of-Thought prompting could be used in conjunction with doctors to help understand why a patient received a diagnosis, as the model could provide complex reasoning for its output.

While Chain-of-Thought prompting is incredibly promising, it is not without its faults. Chain-of-Thought prompting provides no real benefits in large language models under 100B parameters. In fact, the researchers found that Chain-of-Thought prompting actually performed worse than standard prompting in models under 10B parameters. Even with its faults, Chain-of-Thought prompting is still just in its infant stages and can be improved upon. One such way could be the improvement of the model’s input to allow for multiple reasoning outputs, which could help improve how the model should correctly “reason” its answer.

As innovations to LLMs continue at an increasing pace, I personally believe that Chain-of-Thought prompting will be a staple in models for years to come, and allow LLMs to reach new heights in solve rate percentages.

--

--