Tree of Thoughts (ToT) Prompting: The Basics

7 min readMay 28, 2023

A jade tree — Image generated by DALLE-2 through Microsoft Designer.

Prompt engineering is an emerging field in the realm of large language models (LLMs), where techniques are being developed to enable these models to surpass anticipated performance metrics and results. Cognitive science concepts are becoming the foundation for new-age prompting techniques as researchers strive to more closely mimic the processes of the human brain, giving these models the ability to solve in-depth, multi-step problems. However, despite these advancements, LLMs still face certain challenges, particularly in tasks that demand exploration, strategic foresight, and other complex thinking processes.

To address these limitations, a novel prompting technique called Tree of Thoughts (ToT) prompting has been introduced. ToT frames any problem as a search over a tree, where each node is a state representing a partial solution with the input and the sequence of thoughts so far.

Figure 1: Comparison of prompting technique structures. (Yao et al, 2023)

Process Outline

Tree of Thoughts is a paradigm that allows LLMs to explore multiple reasoning paths when solving problems. ToT architecture has four parts:

Thought decomposition
Break down the problem-solving process into smaller thought steps. A thought should be big enough for LLMs to evaluate its usefulness but small enough for them to generate diverse samples.
Thought generator
Generate potential next thoughts for each state in the tree. There are two strategies:
a) Sample independent thoughts from a CoT (Chain of Thought) prompt, which works well for rich thought spaces like paragraphs.
b) Propose thoughts sequentially using a “propose prompt”, which works better for constrained thought spaces like single words or lines.
State evaluator
Evaluate the progress made by each state in the tree. This serves as a heuristic for the search algorithm to decide which states to explore further. Two strategies for evaluating states are:
a) Value each state independently by reasoning about it and generating a scalar value or classification.
b. Vote across states by comparing different states and voting for the most promising one.
Search algorithm: Choose a tree search algorithm to use for exploring the problem space.
a) Breadth-first search (BFS): This algorithm maintains a set of the most promising states per step. It’s used for problems where the tree depth is limited, and initial thought steps can be evaluated and pruned to a small set.
b) Depth-first search (DFS): This algorithm explores the most promising state first until the final output is reached or the state evaluator deems it impossible to solve the problem from the current state. In the latter case, the subtree is pruned to trade exploration for exploitation. DFS backtracks to the parent state to continue exploration.

Example

In the paper, the authors use three different problems that state-of-the-art LLMs like GPT-4 struggle with. They then use and apply ToT prompting to these problems to achieve impressively high results (See Figure 3). One of these challenges is called the Game of 24. It is a mathematical reasoning challenge, where the goal is to use 4 numbers and basic arithmetic operations (multiplication, addition, division, and subtraction) to obtain an answer of 24. For example, given input “4 9 10 13”, a solution output could be “(10–4) * (13–9) = 24”. The basics of the process is discussed below and demonstrated by the researchers in Figure 2.

Figure 2: Game of 24 task flow. (Yao et al, 2023)

To kick off the problem-solving process, the propose prompt, from the paper’s Github repository, that would iteratively be called per node of the tree would look something like this:

propose_prompt = '''Input: 2 8 8 14
Possible next steps:
2 + 8 = 10 (left: 8 10 14)
8 / 2 = 4 (left: 4 8 14)
14 + 2 = 16 (left: 8 8 16)
2 * 8 = 16 (left: 8 14 16)
8 - 2 = 6 (left: 6 8 14)
14 - 8 = 6 (left: 2 6 8)
14 /  2 = 7 (left: 7 8 8)
14 - 2 = 12 (left: 8 8 12)
Input: {input}
Possible next steps:
'''

The propose prompt would be called on the four input numbers from which “possible next steps” are produced, extracted and become nodes themselves.

After each node “thought” generation, the new nodes ( “possible next steps”) are iteratively evaluated to determine how much that specific node contributes to the end solution. Such an evaluation prompt could look something like this:

value_prompt = '''Evaluate if given numbers can reach 24 (sure/likely/impossible)
10 14
10 + 14 = 24
sure
11 12
11 + 12 = 23
12 - 11 = 1
11 * 12 = 132
11 / 12 = 0.91
impossible
4 4 10
4 + 4 + 10 = 8 + 10 = 18
4 * 10 - 4 = 40 - 4 = 36
(10 - 4) * 4 = 6 * 4 = 24
sure
4 9 11
9 + 11 + 4 = 20 + 4 = 24
sure
5 7 8
5 + 7 + 8 = 12 + 8 = 20
(8 - 5) * 7 = 3 * 7 = 21
I cannot obtain 24 now, but numbers are within a reasonable range
likely
5 6 6
5 + 6 + 6 = 17
(6 - 5) * 6 = 1 * 6 = 6
I cannot obtain 24 now, but numbers are within a reasonable range
likely
10 10 11
10 + 10 + 11 = 31
(11 - 10) * 10 = 10
10 10 10 are all too big
impossible
1 3 3
1 * 3 * 3 = 9
(1 + 3) * 3 = 12
1 3 3 are all too small
impossible
{input}
'''

These evaluations are extracted and then the best node is selected, again, depending on selection and evaluation strategy, this process comes down to the specific problem at hand. The entire process repeats until a node presents a solution that equals 24, or more generally achieves the desired goal. When this occurs, a chain of thought prompt is created. In this final prompt the path that has been selected to arrive at a solution is injected and a final answer is given. An example of such a prompt could be structured as follows:

cot_prompt = '''Use numbers and basic arithmetic operations (+ - * /) to obtain 24. Each step, you are only allowed to choose two of the remaining numbers to obtain a new number.
Input: 4 4 6 8
Steps:
4 + 8 = 12 (left: 4 6 12)
6 - 4 = 2 (left: 2 12)
2 * 12 = 24 (left: 24)
Answer: (6 - 4) * (4 + 8) = 24

Input: 2 9 10 12
Steps:
12 * 2 = 24 (left: 9 10 24)
10 - 9 = 1 (left: 1 24)
24 * 1 = 24 (left: 24)
Answer: (12 * 2) * (10 - 9) = 24

Input: 4 9 10 13
Steps:
13 - 10 = 3 (left: 3 4 9)
9 - 3 = 6 (left: 4 6)
4 * 6 = 24 (left: 24)
Answer: 4 * (9 - (13 - 10)) = 24

Input: 1 4 8 8
Steps:
8 / 4 = 2 (left: 1 2 8)
1 + 2 = 3 (left: 3 8)
3 * 8 = 24 (left: 24)
Answer: (1 + 8 / 4) * 8 = 24

Input: 5 5 5 9
Steps:
5 + 5 = 10 (left: 5 9 10)
10 + 5 = 15 (left: 9 15)
15 + 9 = 24 (left: 24)
Answer: ((5 + 5) + 5) + 9 = 24

Input: {input}
Steps:
{steps}
Answer: 
'''

Where {steps} are all the nodes that were voted/evaluated to be the most likely to achieve the end goal.

The ToT approach to problem-solving produced incredible results in the respective tasks compared to other techniques, as shown by the researchers in Figure 3.

Figure 3: Game of 24 success rates using different prompting techniques. (Yao et al, 2023)

Benefits

ToT offers several benefits for problem-solving with LMs:

Generality: Other prompting techniques like IO, CoT, CoT-SC, and self-refinement can be seen as special cases of ToT (i.e., trees of limited depth and breadth).
Modularity: The base LM, thought decomposition, generation, evaluation, and search procedures can all be varied independently.
Adaptability: ToT accommodates different problem properties, LM capabilities, and resource constraints.
Convenience: No extra training is needed — a pre-trained LM is sufficient.

Considerations

ToT requires more resources than conventional prompting methods to improve task performance; however, its modular flexibility allows users to customize performance-cost trade-offs. Increasingly performant open-source efforts, such as GPT4All or LLaMA, should help reduce these costs in the near future.

The researchers do note that deliberate search methods like ToT may not be necessary for many tasks that advanced LMs like GPT-4 already excel at. The tasks used in the paper, which challenge GPT-4, were simply used to demonstrate the higher requirement of better search and planning abilities.

Conclusion

As I am sure you can see the potential for this type of prompting technique is limitless. As LMs are applied to more real-world decision-making applications, more complex tasks could emerge, presenting new opportunities to apply prompting techniques such as ToT.