Towards Reasoning in Large Language Models: A Survey

Dhanasree Rajamani
9 min readMay 9, 2023

--

Abstract

Large Language Models (LLMs) have made significant progress in Natural Language Processing in recent years, and these models may exhibit reasoning abilities when they are sufficiently large. This article is an overview on the current state of knowledge on reasoning in LLMs, techniques for improving and eliciting reasoning in these models, methods and benchmarks for evaluating reasoning abilities, findings and implications of previous research in this field, and reflection and discussion on current and future state of the field.

Introduction

Reasoning, a fundamental aspect of human intelligence, plays a crucial role in problem solving, decision making, and critical thinking. It is a cognitive process that involves using evidence, arguments and logic to arrive at conclusions or make judgements.

Recently Large Language Models have made significant advances in Natural Language Processing and related fields and these models exhibit emergent behaviors, including the ability to “reason”, when they are sufficiently large. This has sparked considerable interest in the community since reasoning ability is the hallmark of human intelligence that is frequently considered missed in current artificial intelligence systems.

Despite the strong performance of LLMs on certain reasoning tasks, it is unclear whether LLMs are actually reasoning and to what extent they are capable of reasoning.

What is Reasoning?

Reasoning is the process of thinking in a logical and systematic way, using evidence and past experiences to reach a conclusion or make a decision. It involves making inferences, evaluating arguments, and drawing logical conclusions based on available information.

Here is a summary of several main categories of reasoning that are commonly recognized.

Deductive Reasoning : Deductive reasoning involves drawing logical conclusions based on the truth of the premises.

Premise : All mammals have kidneys

Premise : All whales are mammals

Conclusion : All whales have kidneys

Inductive Reasoning : In inductive reasoning, a conclusion is drawn based on observations or evidence.

Observation : Everytime we see a creature with wings, it is a bird

Observation : We see a creature with wings

Conclusion : The creature is likely to be a bird

Abductive Reasoning : In Abductive reasoning, conclusion is drawn based on the best explanation for a given set of observations.

Observation : The car cannot start and there is a puddle of liquid under the engine

Conclusion : The most likely explanation is that the car has a leak in the radiator

Analogical reasoning involves making comparisons between two or more things in order to make inferences or arrive at conclusions.

Causal reasoning involves identifying and understanding the causes and effects of events or phenomena.

Probabilistic reasoning involves making decisions or arriving at conclusions based on the likelihood or probability of certain outcomes.

Formal Reasoning vs. Informal Reasoning

Formal reasoning is a systematic and logical process that follows a set of rules and principles, often used in mathematics and logic. Informal reasoning is a less structured approach that relies on intuition, experience and common sense to draw conclusions, and solve problems, and is used in everyday life.

This survey encompasses various forms of reasoning, with a particular focus on “Informal deductive reasoning” in Large Language Models since it is a widely used form in which the conclusion is guaranteed to be true as long as the premises are true.

Towards Reasoning in Large Language Models

Recent research has suggested that reasoning ability may emerge in language models at a certain scale, such as models with over 100 billion parameters. In this article, we consider reasoning as an ability that is rarely present in small scale models, and therefore focus on techniques applicable to improving or eliciting reasoning in large-scale models.

Fully Supervised Fine Tuning

There is research working on eliciting reasoning in small language models through fully supervised finetuning. These research studies involve tasks such as generating rationales that explain model predictions, common sense question answering, performing reasoning/inference based on explicit or implicit knowledge, solving competition mathematics problems and multi-step reasoning for program synthesis/execution. Fully Supervised finetuning has two major limitations: it requires explicit reasoning datasets which can be difficult to create, and the model may rely on artifacts in the training data rather than actual reasoning to make predictions.

Prompting and In-Context Learning

Studies suggest that the lack of exploration into the full capabilities of large language models such as GPT-3 and PaLM, is the reason why they seem inadequate for tasks that require multiple steps of reasoning to solve. However, these models demonstrated remarkable few-shot performance across a variety of tasks through in-context learning.

Chain of Thought

Chain of thought prompting approach involves providing a few examples of ‘chain of thought’, which are intermediate natural language reasoning steps, in the prompt to LLMs. This replaces <input,output> with <input, chain of thought, output>.

Example : <input> Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?

<Chain of thought> Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5+6 = 11.

<Output> The answer is 11.

Chain of Thought approach engages in reasoning rather than providing answers directly, and this can improve LLM’s few-shot performance on arithmetic, symbolic, and common sense reasoning tasks to a great extent.

Rationale Engineering

Rationale engineering aims to elicit or utilize reasoning in LLMs effectively through Rationale refinement. It involves creating more effective examples of reasoning steps, or through rationale exploration and rationale verification.

Rationale creation and refinement aims to create and refine rationale examples that are better able to elicit reasoning in LLMs. Experiments show that LLMs’ performance improves with increased rationale complexity and by providing more thorough examples of solutions(for simple math calculations). Analysis shows that making examples diverse is important in prompting LLMs to produce better rationales.

Rationale Exploration is allowing LLMs to fully explore various ways of reasoning to improve their performance on reasoning tasks. This involves sampling a diverse set of rationales, rather than just the greedy one(chain of thought prompting), and selecting the most consistent answer by marginalizing out the sampled rationales.

Rationale Verification is ensuring that the rationales produced by LLMs are valid, as incorrect rationales can lead to incorrect final predictions.

Problem Decomposition

While Chain of thought prompting is effective for eliciting reasoning LLMs, it can struggle with complex tasks that require compositional generalization. Problem decomposition is a technique where a complex problem is broken down into smaller, more manageable subproblems, and solving these subproblems effectively solves the complex problem(Divide and Conquer).

Least-to-most prompting consists of decomposing the complex problems into subproblems and solving these subproblems in a specific order, with each subproblem being facilitated by the answers obtained from previously solved subproblems.

Dynamic least-to-most prompting is designed to solve more realistic semantic parsing problems by decomposing the problems with prompting-based syntactic parsing and dynamically selecting examples based on the decomposition.

Successive prompting iteratively decomposes a complex problem into a simple problem with the next subproblem prediction having access to the answers to the previous subproblems.

Hybrid Method

Prompting techniques can help elicit reasoning LLMs to solve reasoning tasks, but do not actually improve reasoning capabilities, as the parameters of the model remain unchanged. The Hybrid approach aims to simultaneously improve the reasoning capabilities of LLMs and make better use of these models to solve complex problems. It involves enhancing reasoning capabilities of the LLMs and using techniques such as prompting to effectively utilize these capabilities.

Reasoning-Enhanced Training and Prompting

Fine Tuning or Pre-training models on the datasets that include reasoning, training models on datasets containing scientific and mathematical data improve the reasoning capabilities of LLMs. Chain of Thought data and prompting are critical in improving the reasoning ability of LLMs. Few-shot scratchpad fine tuning and scratchpad prompting lead to significant improvements in LLMs’ ability to generalize to longer problems, compared to standard fully supervised finetuning.

Bootstrapping and Self-Improving

Bootstrapping is the process where LLMs self-improve their reasoning abilities, instead of fine tuning LLMs on pre-built datasets with reasoning. Self Taught Reasoner(STaR) is an example in which a LLM is trained and refined on its own output iteratively with CoT prompting. The model generates initial rationales, is finetuned on rationales that lead to correct answers, and then the process is repeated to generate better training data and further improve the model. LLMs can self-improve reasoning abilities without the need for supervised data by leveraging the self-consistency of reasoning.

Measuring Reasoning in Large Language Models

Downstream Task performance

Measuring the reasoning abilities of LLM based on their performance on tasks that require reasoning.

  • Arithmetic Reasoning : The ability to understand and apply mathematical concepts and principles in order to solve problems involving arithmetic operations.
  • Commonsense Reasoning : Using everyday knowledge and understanding to make judgements and predictions about new situations. Benchmarks used for testing commonsense reasoning abilities of LLMs are : CSQA, StrategyQA and ARC.
  • Symbolic Reasoning : This is a form of reasoning that involves the manipulation of symbols according to formal rules. Benchmarks of symbolic reasoning are presented in Last letter concatenation and coin flip.
  • There are many other benchmarks to evaluate reasoning abilities of LLMs such as BIG-bench, SCAN, WikiTableQA, FetaQA etc.

Formal Analysis on Reasoning

Most existing studies primarily report the performance of the models on the accuracy of downstream tasks, rather than accessing their reasoning steps. Hence to determine whether the models are actually able to reason similar to human reasoning, or if they achieve it through other means further research is required. There have been some efforts to develop metrics and benchmarks that enable more formal reasoning in LLMs — such as ROSCOE, and PrOntoQA and FOLIO.

Findings and Implications

Reasoning seems to be an emergent ability of LLMs : Reasoning ability appears to emerge only in LLMs, hence it may be more effective to utilize large models for general reasoning problems, rather than training small models for specific tasks.

Chain of thought elicits “reasoning” of LLMs : Usage of CoT prompts has improved the performance of LLMs on various reasoning tasks, enabling LLMs produce valid individual proof steps. CoT prompt has also improved out-of-distribution robustness of LLMs, which is not typically observed in standard prompting or fully supervised fine-tuning paradigms. CoT prompting sometimes chooses the wrong steps when multiple options are available.

LLMs show human-like content effects on reasoning : LLMs’ predictions are influenced by prior knowledge and abstract reasoning, and their judgements of logical validity are impacted by the believability of the conclusions. Language models may not always perform well on reasoning, but these failures correspond to tasks which are challenging for humans too.

LLMs are still unskilled at complex reasoning : LLMs such as GPT3 and BLOOM, despite their impressive reasoning capabilities, still struggle with more complex reasoning tasks, and even simple planning tasks involving common sense which are easy for humans. This shows that existing benchmarks may be too simple to accurately gauge the true reasoning abilities of LLMs.

Reflection, Discussion and Future Directions

Why reasoning? Incorporating reasoning capabilities into language models enables them to perform complex tasks which require nuanced thinking, such as problem solving, decision making and planning. This improves performance of language models on downstream tasks, making them more explainable and interpretable.

Right Application? Solving simple math problems does not reflect the reasoning capabilities of LLMs. It is important to consider more complex, realistic and meaningful applications such as decision making, legal reasoning, and scientific reasoning to understand the reasoning abilities of LLMs.

Are language models really able to reason? Although there are indications that LLMs are able to reason, such as their high performance on tasks and their ability to reason step-by-step, there are also limitations such as their reliance on heuristics, inconsistent and incorrect rationales for reasoning. LLMs still struggle with complex tasks and make mistakes in reasoning. It is too early to draw a conclusion, and there is need to further analysis on training data, model architectures and better benchmarks to measure reasoning capabilities of LLMs.

Improving reasoning capabilities of LLMs : To enhance reasoning in LLMs, it is important to utilize training data, model architecture, and optimization objectives designed to encourage reasoning. Fine tuning a model with a dataset including CoT data or Bootstrapping improves reasoning in models.

Conclusion

In this article, we discussed the current state of knowledge on reasoning on LLMs, techniques for improving and eliciting reasoning abilities, methods and benchmarks for evaluating reasoning abilities, and the findings and implications of previous studies in this topic. While LLMs have made significant progress, it is still unclear whether they are capable of truly reasoning or just rely on memorized patterns and heuristics. Further research is needed to understand and improve their capabilities for various applications.

References

https://arxiv.org/pdf/2212.10403.pdf

--

--