Guiding Large Language Models Through Chain of Thought

Published in

d*classified

6 min readOct 11, 2023

Large Language Models (LLMs) have increased in popularity ever since the release of Chat-GPT in late 2022, many teams are exploring ways to harness its power and flexibility in various applications in the defence domain. Amanda Koh Kai Yen, as part of her internship with C3 Development programme centre, explored the use of Chain of Thought in AI Assistant to utilise LLMs beyond information they are trained on. This project was supervised by Meo Kok Eng.

Background

A challenge that consistently crops up is that LLMs only have access to the information that they are trained on. This poses a problem when we are trying to use LLMs in a military context, where a lot of the information that we might want the model to know is either classified material or real-time information that may update regularly. Consider the operator of a C3 (Command, Control, and Communication) system queries “Give me the latest incident report from the XX Naval Base”, or “What is the status of serviceman in overseas training facility YY?”. A regular LLM would not be able to answer these questions because it would not be trained with the most up-to-date information.

This is where API calling comes in: if we give the LLM some method of calling a provided API, we are essentially giving it access to all the latest information in our datastores, and we can do so with tremendous flexibility since the provided APIs, and not the base LLM, contain the required information.

Chain of Thought 101

From a purely theoretical viewpoint, how can we get a LLM to call an API? Based on the question provided by the user, it would have to reason about the appropriate API to call, and then act on that by calling it with an appropriate input. Based on the information it gets back, it can give a final answer to the user or choose to call another API. This is the essential idea behind the Re-Act framework (reasoning and acting), as shown below:

For example, for the question “What is the manpower status of Unit A?”, the following may be the thought process desired from the LLM, with the LLM output in italic.

Thought: The user has asked for the manpower status of a specific unit. The Get Manpower Status Tool will provide the necessary information.
Action: Get Manpower Status
Action Input: Unit A
Observation: Fully Staffed
Thought: I have the final answer.
Final Answer: The manpower status for Unit A is Fully Staffed.

Implementation using LangChain

In fact, using LangChain, a popular framework designed to simplify the creation of applications using LLMs, the above example is exactly the output required from the LLM by the LangChain agent. An agent is given access to a set of APIs (known as tools in LangChain) that they can use, and it then goes through the ReAct process in order to answer user queries.

The way we provide it with access to the tools is simple: for each API we want to give it access to, we give it a name, purpose of the tool and the expected input format at the beginning of the prompt. An example of a tool description could be:

Get Manpower Status: only if the user specifically asks for the manpower status of a unit. The input to this tool should be the unit required in the following format: 'Unit X'.

In the prompt, we also give the agent the required format to be output by the LLM (so that LangChain is able to parse the LLM output and perform the necessary actions) as well as several output examples (for few-shot learning).

However, I faced some problems when using LangChain with an open source LLM. The LLM would often output with the wrong format (e.g. not having an action or action input) or try to call on non-existent tools, both of which would lead to LangChain not being able to continue parsing the LLM output.

Pivoting to Guidance

I found Guidance to have features that could directly address the problems I faced with LangChain. Guidance is a lower level framework than LangChain: rather than trying to create applications with LLMs, Guidance only aims to control LLM output. Thus, what I did was to replicate LangChain’s agents using multiple prompts and some custom logic, and use Guidance to control the LLM generation.

These are the two main improvements Guidance allows over LangChain for this use case:

Interleaved generation to always follow output format: When trying to use a tool, instead of having the LLM output the keywords “Action” and “Action Input” (along with the actual tool name and input) by itself, these keywords are in the prompt and the LLM only has to generate what comes after (i.e. the actual tool name and input).

Going back to our previous example:

LangChain requires the LLM to generate “Action: Get Readiness, Action Input: Unit A” in whole
Guidance only requires the LLM to generate “Get Readiness” and “Unit A”

2. Select function to only allow calling of provided tools: When it comes to generating the tool required, instead of allowing free-form generation, which could lead to non-existent tools being called (in LangChain’s case), Guidance simply checks to see which of the provided tools is the most likely to be generated by the LLM and outputs that.

e.g. If “Tool A” and “Tool B” were the only provided tools, Guidance would only allow one of those tools to be chosen, whereas LangChain could theoretically free-form generate “Tool C”, which does not exist.

Front End Demonstration

After settling on using Guidance as the framework, I modified a ChatGPT frontend clone to run this backend. There is an additional “Thoughts” column added for users to be able to see the thought process behind the final answer if desired. Real-time streaming is also supported, i.e. the answers shows up as the words are being generated, rather than in one go at the end, to reduce perceived latency.

Here are screenshots of some examples:

Conclusion

Overall, I was able to extend the use of LLM to potentially query data in C3 systems with Chain of Thought. I found that Guidance gave much more predictable and controlled results when using it with open source LLMs. Testing with various tools yielded fairly accurate results, but more testing definitely will have to be done with the specific tools that are to be implemented in a given system, since factors such as the number of tools, how distinct tools are from each other, and whether the user query contains the necessary tool name would all affect the overall accuracy of the model. Other parts of the prompt can also be adjusted to see which would yield the best results.

Looking ahead, it is worth noting that LangChain works fairly well when using OpenAI’s models, so the limitations found with LangChain are not so much inherent to the framework but rather a limitation of the LLM we are using (which has to be on-premise). Given the better developer support and integrations of LangChain, newer open-source models are released for on-premise use, they might be able to follow instructions better and thus perform much more accurately.