So you think you can prompt?
Looking at research on prompting (mostly Chain of Thought).
The rise of Generative AI (Gen-AI) and its chatbots has undeniably taken the world by storm. It’s almost Luddite-like to not be aware of their impact. I certainly feel a bit like one writing this article because are still people trying to solve problems themselves and looking for answers on the web? Hopefully I am not in the minority. Luddite or not, the advances are truly fascinating and here in my humble attempt to keep up with things, I am sharing my notes on my deep dives in the world of prompting.
Prompts: Origin
Historically, ML models required inputs in a fixed format and therefore were inaccessible to a vast majority of people and were unable to infiltrate our day to day chores. However, with the increasing capabilities of large language models (LLMs), in-context learning (ICL) has emerged as a new paradigm for natural language processing (NLP). ICL allows LLMs to make predictions based on contexts — usually provided through a prompt — that enable users to interact with them using natural language and apply them to everyday problems.
A “prompting-only” approach became viral because
- It doesn’t require a large training dataset to perform a task successfully.
- A single model checkpoint can perform many tasks without loss of generality, making it easier to provide it as a service (think of all the bots) to end-users.
- It eliminates fixed input formats, making it very easy to use without a steep learning curve (unlike previous non-NLP-based models).
Now that natural language, prompt-based models are ubiquitous in user product journeys and our lives, it is imperative to learn more about them. Understanding their full potential, limitations and biases allows us to use them responsibly and benefit from them without falling prey to the misinformation (worst case) or looking like an idiot saying the wrong things confidently (umm least worst case).
How much difference can the quality of a prompt make?
First things first: not all prompts are created equal. The “garbage in, garbage out” rule still applies. The quality of your prompt directly determines the quality of the answer you get and in mathematical cases whether you solved the problem or not.
In short, a prompt makes a HUGE difference! See this research paper for in-depth analysis. This dependency on prompt quality leads to a robustness problem. See Table 1 (in the paper) for a quick overview.
There is a significant gap between the worst performance (lower bound) and best performance (upper bound) for all models. For instance, the worst and best performance of Llama-2–70B-chat are 0.094 and 0.549, respectively, indicating a difference of 0.455. This suggests that the current LLMs’ ability to follow instructions is not robust enough. Even instructions with identical semantics and fluent expressions could lead models like Llama2–70B-chat to plummet from a level comparable to GPT4 (0.5 indicates equivalence to the reference model) to far below the average level (0.292).
^ Quote from the above mentioned paper.
Shot-Based Prompting
This is a quick review of shot prompting, where you provide the context along with some examples of similar solved problems. The number of examples provided are shots (I am not sure how this term came about to be).
Zero-Shot Prompting
Only the problem statement is provided; no examples are given in the prompt.
Example Prompt: How many ‘r’ are there in the word “strawberry”?
Single-Shot Prompting
A single example is given along with the problem statement.
Example Prompt:
Question: How many ‘e’ are there in the word “blueberry”?
Answer: There are 2 ‘e’s in “blueberry”.
How many ‘r’ are there in the word “strawberry”?
Few-Shot Prompting
An LLM receives in-context exemplars of input–output pairs before outputting a prediction for a test-time example.
Example Prompt:
Question: How many ‘e’ are there in the word “blueberry”?
Answer: There are 2 ‘e’s in “blueberry”.
Question: How many ‘p’ are there in the word “pineapple”?
Answer: There are 3 ‘p’s in “pineapple”.
How many ‘r’ are there in the word “strawberry”?
Chain of Thought Prompting
A chain of thought is a series of intermediate natural language reasoning steps that lead to the final output. In chain-of-thought (CoT) prompting, we use few-shot prompting and augment each exemplar with a chain of thought for an associated answer. The prompt consists of triples: <input, chain of thought, output>.
Example Prompt:
Question: How many ‘e’ are there in the word “blueberry”?
Chain of Thought: To find the number of ‘e’s in “blueberry”, I need to examine each letter of the word. Going through the word, I see an ‘e’ at the beginning and another ‘e’ near the end. That makes a total of two ‘e’s.
Answer: There are 2 ‘e’s in “blueberry”.
Question: How many ‘p’ are there in the word “pineapple”?
Chain of Thought: Let’s look at each letter in “pineapple”. I see a ‘p’ at the beginning, another two ‘p’s in the middle. So, there are three ‘p’s.
Answer: There are 3 ‘p’s in “pineapple”.
Question: How many ‘r’ are there in the word “strawberry”?
CoT works better than few-shot prompting alone because traditional few-shot prompting methods (used in Brown et al. (2020)) struggle on tasks that require reasoning abilities and often do not improve significantly with increasing language model scale.
Pros of Chain-of-Thought Prompting:
CoT prompting has several attractive properties as an approach for facilitating reasoning in language models.
- Decomposition: CoT, in principle, allows models to decompose multi-step problems into intermediate steps, allocating more computation to complex problems.
- Interpretability: A CoT provides an interpretable window into the model’s behaviour, suggesting how it arrived at a particular answer and providing opportunities to debug incorrect reasoning paths (although fully characterising a model’s computations that support an answer remains an open question).
- Versatility: CoT reasoning can be used for tasks such as math word problems, common sense reasoning, and symbolic manipulation. It is potentially applicable (at least in principle) to any task that humans can solve via language.
- Ease of Implementation: CoT reasoning can be easily elicited in sufficiently large off-the-shelf language models simply by including examples of CoT sequences into the exemplars of few-shot prompting.
Results of CoT Prompting
This paper qualitatively explores CoT prompting across various baseline datasets and found the following results:
- CoT prompting doesn’t positively impact performance for small models, and it only yields performance gains when used with models of ∼100B parameters. Smaller scale models produced fluent but illogical chains of thought, leading to lower performance than standard prompting. (Note: the paper this was based on was published before Deep-seek so this may no lobger be the case)
- It has larger performance gains for more-complicated problems. For instance, for GSM8K (the dataset with the lowest baseline performance), performance more than doubled for the largest GPT and PaLM models. On the other hand, for SingleOp, the easiest subset of MAWPS which only requires a single step to solve, performance improvements were either negative or very small.
- Chain-of-thought prompting via GPT-3 175B and PaLM 540B compares favourably to prior state of the art, which typically fine tunes a task-specific model on a labeled training dataset.
Robustness of Chain of Thought
CoT is more robust than few-shot prompts. While there is variance among different chain of thought annotations, all sets of chain of thought prompts outperform the standard baseline by a large margin. This implies that successful use of chain of thought does not depend on a particular linguistic style. CoT prompting for arithmetic reasoning is robust to different exemplar orders and varying numbers of exemplars.
Different number of exemplars
Gains from chain-of-thought prompting generally still held when there was a varying number of few-shot exemplars. Increasing the number of exemplars in standard prompting did not lead to significant gains (e.g., increasing from 8 to 16 exemplars did not improve the performance of standard prompting enough to catch up with chain-of-thought prompting). This shows that CoT is more robust than standard few shot prompting.
Does CoT work on other types of problems?
Common sense Reasoning
So far, most of the results are based on mathematical problem datasets. The language-based nature of chain of thought actually makes it applicable to a broad class of common sense reasoning problems.
Similar to the mathematical problems, for all tasks, scaling up model size improved the performance of standard prompting; CoT prompting led to further gains even when the tasks required a range of common-sense abilities (though the improvement was minimal).
Symbolic Reasoning
To measure improvements at symbolic tasks, the paper tests various models with CoT prompts at two toy tasks:
- Last-letter concatenation: This task asks the model to concatenate the last letters of words in a name (e.g., “Amy Brown” → “yn”).
- Coin flip: This task asks the model to answer whether a coin is still heads up after people either flip or don’t flip the coin (e.g., “A coin is heads up. Phoebe flips the coin. Osvaldo does not flip the coin. Is the coin still heads up?” → “no”).
These in-domain evaluations are “toy tasks” in the sense that perfect solution structures are already provided by the chains of thought in the few-shot exemplars; all the model has to do is repeat the same steps with the new symbols in the test-time example.
However, small models still fail — the ability to perform abstract manipulations on unseen symbols for these tasks only arises at the scale of 100B model parameters.
Limitations of CoT
CoT is not a universal solution and comes with a set of limitations.
- Annotation Cost: While the cost of manually augmenting exemplars with chains of thought is minimal in the few-shot setting, such annotation costs could be prohibitive for fine tuning (though this could potentially be surmounted with synthetic data generation, or zero-shot generalisation).
- Computational Cost: The emergence of chain-of-thought reasoning only at large model scales makes it costly to serve in real-world applications.
- Reasoning Errors: There is no guarantee of correct reasoning paths, which can lead to both correct and incorrect answers.
- “Reasoning” Question: Although chain of thought emulates the thought processes of human reasoners, this does not answer whether the neural network is actually “reasoning”.
- Model Dependence: If the same prompt is used for 3 different models, we don’t get the same improvement across all 3 models. The fact that gains don’t transfer perfectly among models is a limitation.
- Model Specificity: Further work is needed to investigate how different pre-training datasets and model architectures affect the performance gain from chain-of-thought prompting.
Why does the Chain of Thought work?
The observed benefits of using CoT prompting raises the natural question of whether the same performance improvements can be conferred via other types of prompting.
The paper tests three variations to see if CoT is the cause of improved performance.
Equation-Only Prompting
Instead of the final answer, the model was asked to output only the final equation. Equation-only prompting doesn’t help much for questions that are too challenging to directly translate into an equation without the natural language reasoning steps in the chain of thought. However, for datasets of one-step or two-step problems, we find that equation only prompting does improve performance, since the equation can be easily derived from the question.
Variable Compute Only
Another intuition is that chain of thought allows the model to spend more computation (i.e., intermediate tokens) on harder problems. To isolate the effect of variable computation from chain-of-thought reasoning, the paper tests a configuration where the model is prompted to output only a sequence of dots (. . .) equal to the number of characters in the equation needed to solve the problem. This variant performs about the same as the baseline, suggesting that variable computation by itself is not the reason for the success of CoT prompting and that there appears to be utility from expressing intermediate steps via natural language.
Chain of thought after answer
Another potential benefit of chain-of-thought prompting (as compared to. vanilla few-shot prompting) could simply be that such prompts allow the model to better access relevant knowledge acquired during pre-training. Therefore, the paper tests an alternative configuration where the CoT prompt is only given after the answer, isolating whether the model actually depends on the produced chain of thought to give the final answer. This variant performs about the same as the baseline, suggesting that the sequential reasoning embodied in the chain of thought is useful for reasons beyond just activating knowledge.
Conclusion
Prompt engineering is a rapidly evolving field and its power is undeniable. While techniques like chain-of-thought prompting offer significant improvements in LLM performance, they are not a panacea. Understanding both the strengths and limitations of these methods is crucial for responsible and effective application of Gen-AI. As models continue to evolve, so too must our understanding of how to best leverage their capabilities. As we increasingly rely on LLMs in various aspects of our lives, it’s essential to be aware of their potential biases and limitations. The journey of prompt optimisation is just beginning, and continuous learning and experimentation are key to unlocking the full potential of LLMs.
References/Future reads
- The CoT paper, also linked above.
- How much difference a prompt makes, read this.
- Learn more about transformers — that give LLMs its superpowers, read here.

