LLMs as single-step reasoners: mitigating hallucination in multi-step reasoning

LLMs struggle with multi-level reasoning in a single inference step, often resulting in hallucination. Breaking the problem down into smaller steps could be the key!

Published in

Inveterate Learner

6 min readMar 28, 2024

Special thanks to Shreya Sajal, Vishal Agarwal and Aravinth Muthu for giving very helpful feedback on previous versions of this post!

A common challenge with using Large Language Models (LLMs) in production is hallucination, where a model produces responses that sounds real/accurate but is factually incorrect, leading to lack of reliability in LLM-powered apps. We have been using LLMs successfully in production to analyse and extract information from thousands of legal contracts at HyperStart, a contract-lifecycle management tool for Small and Medium scale Businesses (SMBs). In this post, I’ll share a simple technique that has improved the accuracy and reliability of our system on the most challenging tasks multi-fold.

Before we jump into the approach, let’s understand the problem a bit more. The two most common techniques for mitigating hallucination are Chain-Of-Thought (CoT) prompting [1, 2] and Retrieval Augmented Generation (RAG) [3, 4]. If you’re not already familiar with these, do check out the links and return back to the blog post later.

You might be able to tackle hallucination for your specific use case almost entirely using CoT + RAG itself (although you will likely need to iterate a few times to land at the right custom prompt). However, if your problem statement involves a strong reasoning component with multiple reasoning steps, this might not be enough. The gap between where you want to be and where you land after applying CoT + RAG (and their variations) [the reasoning gap] is what this blog post aims to fill.

Note: I haven’t tested Claude Opus/Gemini 1.5 yet (the only other models in the class of GPT4 at the time of this post) and it is very likely that future versions of all these models will be much better at reasoning, in which case you might not need this technique. But one can speculate that when that future arrives, you might not need a lot of things. But that’s a separate post. Let’s focus on solving the problems of today while being optimistic about the future!

For ease of understanding, let’s assume you want the LLM to produce the desired output in a given JSON format. Arriving at the final JSON answer requires multiple steps of reasoning. Let’s consider the following example to automatically screen a candidate’s resume:

System prompt:
You are a very good recruiting assistant. 

You will be given the resume of a candidate for a Machine Learning role. 

You need to assess whether the candidate must be shortlisted.

Use the following criteria for deciding if the candidate must be shortlisted:
- Number of internships in machine learning/deep learning
- GPA
- Number of publications and personal projects they have worked on related to
  machine learning

If the GPA is more than 9.5, then, they only need to have one internship to be
shortlisted. If the GPA is more than 9 but less than 9.5, along with one
internship, they need at least one publications or projects as well. If the
GPA is between 8 and 9, publications + project + internship must be >= 3, with
at least one internship. Otherwise, the number combined must be atleast 4.

Give your final answer as a JSON with the key `is_shortlisted` which should be a boolean value.

This is an example of a multi-step reasoning task as the LLM has to first extract various details like the GPA, identify which parts of the CV represent projects vs internships vs publications, classify the relevance of each of them and then, use the provided rubric to come up with the final answer.

Here, the candidate’s resume will be provided as context to the LLM for extracting the final answer. Since we’ve asked the model to simply output a JSON, without any chain-of-thought prompting, the model will likely instantly respond with:

{ "is_shortlisted": True }

Note: this is a relatively simpler example and the actual challenge here is figuring out how to classify a given project/internship/publication as relevant. But let’s ignore that for now and focus on the model’s behaviour.

You might evaluate this prompt on your dataset and are likely to observe poor performance. Your first instinct should be to utilise “Chain-of-Thought Prompting” which asks the model to include reasoning steps before arriving at the final answer. You’re likely to notice a significant boost in performance.

However, you might notice there are still several mistakes. Upon inspecting the responses, you might notice that the model correctly inferred the GPA as well as the number of relevant projects (n_p), internships (n_i) and publications (n_pu) individually. However, it still failed to arrive at the correct output. For one such error sample, it might have incorrectly added n_pi, n_p, and n_i. For another one, it might have computed the sum correctly but failed to follow the rubric.

There are two types of reasoning steps you’ll encounter when breaking a problem into smaller steps. We’ll be looking at each of them with an example along with how to tackle them.

Deterministic reasoning

In this case, the next reasoning step is deterministic over the outputs of the previous reasoning steps. As shown in the example above, is_shortlisted must be decided using a rubric (mentioned in the system prompt), which is a deterministic step, given the gpa, n_pu, n_i, and n_p as inputs.

Instead of asking the LLM to output is_shortlisted, the core information that only the LLM can extract is really just gpa, n_pu, n_i, and n_p . We don’t need the LLM to refer to the rubric to give the final output. The LLM is just a tool and we should use the right tool to solve a given problem. Once the values have been extracted, the LLM is not the right tool to follow the rubric. The rubric can simply be applied outside the LLM call as shown in the image below.

An example of how using LLMs as single step reasoners can help mitigate hallucinations for the deterministic reasoning case. For semantic reasoning, the deterministic rubric can be replaced by one or more LLM calls as described in the second example below.

An additional benefit of doing this is one level deeper explainability for the final response given by the system as we now know the input to the rubric, which helps with building user trust as well as debugging (in case the inputs have been extracted incorrectly).

Semantic reasoning

Here, the current reasoning step requires an LLM to operate on the outputs of the previous steps to arrive at the final answer. Suppose your task is to count the number of times the co-leads in a movie script are playing a positive role vs a negative role. You could ask the LLM to give the answer directly in one go. But it is possible that it might confuse the positive/negative role occurrences between the co-leads.

A better approach could be:

Extract the names of the co-leads (1 LLM call)
For each name, count the number of times that person played a positive and negative role (N API calls for N co-leads).

Summary

Given the examples above, the core idea can be summarised as:

Break the problem statement down into smaller steps such that every reasoning step requires only one level of reasoning (as explained above).
At each step, use the LLM as an extractor of information that can be used by the next step to get closer to the final output.
Depending on whether the current step requires deterministic or requires semantic reasoning, either avoid using LLMs (e.g. handle conditionals separately through code instead of asking an LLM to follow a given rubric) or make further LLM calls respectively.

It probably does not look very different from thinking step-by-step or using function calling in agentic flows. The key change proposed here is in the mindset of what each step means. Typically, we expect an LLM inference to perform all the reasoning steps in one go before arriving at the final answer, even steps where an LLM is not even needed (e.g. following a rubric). LLM Agent flows would usually consider the multiple steps we’ve proposed as just one block in the flow, with a single LLM call. A mindset shift is needed to see one block as being composed of potentially multiple blocks itself.

There are a couple of tradeoffs in this approach that you must be aware of. If the individual steps require semantic reasoning, you’ll need to make multiple LLM calls, which will increase your inference cost and time. If your problem statement resembles the second example above (extracting the number of negative and positive roles for all the co-leads), you can run the LLM inference for each co-lead in parallel in the second step, reducing the increase in the overall inference time.

Conclusion

This technique has radically improved the reliability of the systems I’ve built and given a massive boost to their overall accuracy. I hope this post was helpful and gave you a handy tool for improving the performance of your LLM apps too.