Teaching LLMs to Think and Act: ReAct Prompt Engineering

10 min readJun 9, 2023

TL;DR

Professors at Princeton University and Google researchers recently published a paper describing a novel prompt engineering method which enables large language models (think ChatGPT) to reason and act intelligently within a simulated environment. This ReAct method mimics how humans operate in the real world, as we reason verbally and can take actions to gain information. ReAct is found to perform well against other prompt engineering (and imitation learning) approaches in a variety of domains. This marks an important step towards Artificial General Intelligence (AGI) and embodied language models (robots that think like humans). Information about the paper and a link to it are below:

Title: “ReAct: Synergizing Reasoning and Acting in Language Models”
Authors: Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao
Publication year: 2023
Publishing venue: International Conference on Learning Representations (ICLR)

ReAct: Synergizing Reasoning and Acting in Language Models

While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and…

doi.org

Background

In this section, I will discuss large language models, prompt engineering, and chain-of-thought reasoning.

Large Language Models

A large language model (LLM) is a machine learning Transformer model that has been trained on a huge corpus, or data set of texts, such as most of the webpages on the Internet. During training, which takes a lot of time (and/or GPUs), energy, and water (for cooling), gradient descent is employed to optimize the model’s parameters so that it becomes good at predicting the training data. In essence, a LLM learns to predict the most probable next word given a sequence of previous words. This can be used to perform inference — finding the likelihood that some text would be generated by the model — or text generation, which LLMs like ChatGPT use to converse with people. Once a LLM is done training, it is frozen, which means that its parameters are saved and it does not add inputs to its training data or retrain — doing so would be infeasible, and as we’ve learned from Microsoft’s Tay chatbot becoming a Nazi, it is probably better not to learn from users anyway. It is important to note that LLMs still learn bias from their training data, and OpenAI, the company behind ChatGPT, had to add safeguards — using reinforcement learning from human feedback (RLHF) — to try to prevent the model from generating problematic content. Also, since LLMs by default just generate the most likely next word based on what they’ve seen without doing any kind of fact-checking or reasoning, they are prone to hallucination, or making up facts, and reasoning errors (such as when doing simple mathematics).

LLMs have been all the rage ever since the public release of ChatGPT took the world by storm. The emergent intelligence of these models and their applications to so many aspects of our lives have made them an incredibly popular tool, with every company wanting a piece of the action. Besides chat bots and coding and writing assistants, LLMs are being used to create agents that interact with simulated environments, including the Internet. ReAct is an example of how to turn a LLM into one such agent.

Prompt Engineering

If you’ve experimented with ChatGPT, you’ll know that sometimes it refuses to answer a question or answers it poorly, but if you rephrase the question you may get a better result. This is the art of prompt engineering — getting a LLM to respond in the way you want it to by modifying your input. The thought is that LLMs have been trained on so much human-generated data that they can almost be treated as a human — instead of training a new model on a specific problem domain, one can just try to elicit the proper response from an existing frozen LLM by bringing up some facts to “jog its memory” or telling it about a new domain. This is known as in-context learning, and there are two main types: zero-shot learning and few-shot learning. Zero-shot learning gives the LLM a prompt that could include some background information before the question/command to help the LLM find a good response. Few-shot learning gives the LLM a few examples of prompts and desirable responses and then poses a new prompt that the LLM will then respond to in the format of the examples.

Prompt engineering is the future of Natural Language Processing (NLP). The field is shifting from customized models to customized prompts because LLMs are so much better than what anyone can make on their own without an incredible amount of time and energy. When a LLM is paired with the right prompt engineering technique, more often than not, it can do anything that a specialized model can.

Chain-of-Thought Reasoning

Chain-of-thought reasoning is a popular prompt engineering technique that is meant to combat reasoning errors. It involves giving the LLM one or more examples (few-shot learning) of how to verbally reason through a problem and then giving it a different problem to solve in that way. This can help with reasoning errors, but it still suffers from hallucination, and hallucinated “facts” can propagate through the reasoning, causing the model to come to the wrong conclusion regardless.

In the image from the ReAct paper below, standard prompting — just asking a question — is compared to chain-of-thought (CoT) prompting (although the additional inputs are not shown) on a question that requires multiple steps of reasoning to figure out. The LLM with standard prompting guesses iPod, which is incorrect. The LLM with CoT prompting has a much more convincing response, but it is still wrong. Despite having flawless reasoning, the LLM hallucinates that the Apple Remote was originally designed to work with the Apple TV (it was actually designed for the Front Row program), which leads it to the wrong conclusion.

Because of the issue of hallucination, CoT reasoning is unreliable. If LLMs are to be a useful tool, they cannot be making up facts left and right, because then we can never trust them and are better off just doing the research ourselves. ReAct aims to solve this issue by allowing the LLM to take actions such as searching Wikipedia so that it can find facts and reason from those.

Method

Like chain-of-thought reasoning, ReAct is a prompt engineering method that uses few-shot learning to teach the model how to solve problems. CoT is supposed to imitate how humans think about problems, and ReAct also includes this reasoning element, but it goes further by allowing the agent text actions that let it interact with its environment as well. Humans use verbal reasoning (speaking or thinking) to help us strategize and remember things, but we can also take actions to gain more information and achieve our goals. This is the foundation of ReAct. A ReAct prompt includes examples with actions, the observations gained by taking those actions, and the transcribed thoughts (reasoning strategies) of the human at various steps in the process. The LLM learns to emulate this approach of interleaved thinking and acting, making it an agent in its environment. Below is an illustration of how a ReAct agent operates, with a tragic example (in Thoughts -> Action -> Observation order) shown in monospaced font.

It is important to remember that the observations are not generated by the LLM but by the environment, which is a separate module that the LLM can interact with only through specific text actions. Therefore, in order to implement ReAct, you need:

An environment that takes a text action (out of a set of potential actions which can change based on the environment’s internal state) and returns a text observation.
An output parser framework that stops the agent from generating text once it has written a valid action, executes that action in the environment, and returns the observation (appends it to the text generated so far and prompts the LLM with that).
Human-generated examples of intermixed thoughts, actions, and observations in the environment to use for few-shot learning.

The number of examples and the details of them is up to you. The beginning of an example used in a ReAct prompt is shown below.

Here you can see that the thoughts, actions, and observations are clearly labeled as such and that the actions use a special format — with the query in brackets — so that the agent will learn to write them in this way and the output parser can then easily extract the query.

Results

For their frozen LLM, Yao et al. (2023) use PaLM-540B. They test ReAct prompting with this LLM on two knowledge-intensive reasoning tasks and two decision-making tasks. I will discuss each in turn.

Knowledge-Intensive Reasoning Tasks

The two domains used in this task area are HotPotQA, which is multi-hop question answering using Wikipedia passages, and FEVER, which is fact verification. The agent is given the ability to interact with a purposefully simple Wikipedia API using the following actions:

search: Find a page by name or list of most similar results.
lookup: Find a string in a page.
finish: End the task with an answer.

In these domains, ReAct is compared to the following techniques:

Standard: No thoughts, actions, or observations in prompt.
CoT: No actions or observations in prompt.
CoT-SC (self-consistency): CoT prompt. A certain number of responses from the LLM are sampled and the majority is chosen to be the answer.
Act: No thoughts in prompt.
ReAct -> CoT-SC: Starts as ReAct but switches to CoT-SC if it starts to falter.
CoT-SC -> ReAct: Starts as CoT-SC but switches to ReAct if it starts to falter.

Success is measured by accuracy in FEVER and EM in HotPotQA. The plots below show the results in each domain as a function of the number of sampled responses for CoT-SC.

ReAct did poorly in HotPotQA but outperformed CoT in FEVER. ReAct is much less prone to hallucination than CoT, but has a higher reasoning error rate. Although ReAct does have this shortcoming, the ReAct -> CoT-SC and CoT-SC -> ReAct methods are the most successful of the bunch. Below is the same question from the beginning of this article with ReAct’s response, which is correct.

Decision-Making Tasks

The two domains used in this task area are ALFWorld and WebShop. I shall explain each domain individually.

ALFWorld

ALFWorld is a text-based game with realistic environments. It has text actions for moving around in and interacting with the simulated world, such as “Open drawer 1.” A goal given to the agent might be to find a specific object in a house, and therefore common-sense reasoning is helpful to know where such an object would commonly be found. The baselines ReAct is compared to in this domain are:

Act: No thoughts in prompt.
BUTLER: An imitation learning approach.
ReAct-IM (Inner Monologue): Can only think about the environment and how close it is to the goal.

The measure of success is the percentage of trials where the goal was reached. ReAct outperforms the baselines.

WebShop

WebShop is a simulated online shopping website with data crawled from Amazon. It is a challenging domain because it has a large number of actions for navigating the website and searching for products. The goal is to find an item that matches a user’s specifications. The baselines ReAct is compared to in this domain are:

Act: No thoughts in prompt.
IL: An imitation learning approach.
IL + RL: An imitation and reinforcement learning approach.

The measure of success is how close the chosen item is to the hidden one that the user had in mind. ReAct outperforms the baselines.

Discussion

ReAct, although not perfect by itself due to its reasoning errors, is nonetheless a strong prompt engineering approach that overcomes the fact hallucination issue of chain-of-thought reasoning and also allows the LLM to become an agent that can interact with its environment. In addition, it is a very interpretable method, since the agent outputs its thought process as it acts.

I believe that ReAct is a step towards Artificial General Intelligence (AGI) and embodied language models (robots that think like humans). If a robot had a method of modeling a foreign environment based on familiar features and creating prompts using that model, it could (at least attempt to) act by itself in a large variety of domains without the need for human-crafted examples. It would also need a memory of some sort, or the ability to learn from its experiences, to make it even more human-like. It is unclear at this point whether the creation of AGI would help or hurt humanity, but robots with common-sense knowledge, as long as bugs like reasoning errors and hallucinations are worked out, could be a great help to us (as firefighters, for instance).

LLM agents have already been commercialized and are being used for a variety of tasks, from creating websites to ordering pizza. There are also non-commercial applications, such as destroying humanity. I only hope that these tools can be used for good as well. An agent with the goal of figuring out how to solve the world’s problems might be nice.