MCT Self-Refine algorithm : integrating LLMs with Monte Carlo Tree Search for achieving GPT-4 level complex mathematical reasoning with Llama-3 8B

9 min readJun 17, 2024

LLMs faces challenges in accuracy and reliability in strategic and mathematical reasoning. For addressing it, authors of this paper[1], introduces the MCT Self-Refine (MCTSr) algorithm, an innovative integration of Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS), designed to enhance performance in complex mathematical reasoning tasks.

The algorithm constructs a Monte Carlo search tree through iterative processes of Selection, self-refine, self-evaluation, and Backpropagation, utilizing an improved Upper Confidence Bound (UCB) formula to optimize the exploration-exploitation balance.

Key contributions:

develop and validate a novel reasoning algorithm by integrating LLMs with UCT-MCTS. Authors enhanced the algorithm’s key components to accommodate the integration with LLMs better and demonstrate its effectiveness on Olympic-level mathematical problems.
propose a dynamic pruning module that refines decision-making processes within the MCTS framework, facilitating more efficient and accurate problem-solving capabilities
extensive experimentation provide insights into the synergistic potential of LLMs and MCTS, showcasing improved performance in complex reasoning tasks
This research advances the application of LLMs in sophisticated reasoning challenges. It sets the stage for future innovations in integrating AI technologies for enhanced decision-making, reasoning accuracy, and reliability of LLM-driven applications

Preliminary

i) Monte Carlo Tree Search (MCTS)

decision-making algorithm widely used in games and complex decision processes, which operates by building a search tree and simulating outcomes to estimate the value of actions

MCTS algorithm comprises four distinct phases:

a) Selection: Starting from the root, the algorithm navigates through promising child nodes based on specific strategies (e.g., UCT), continuing until a leaf node is reached.

b) Expansion: At the leaf node, unless it represents a terminal state of the game, one or more feasible new child nodes are added to illustrate potential future moves.

c) Simulation or Evaluation: From the newly added node, the algorithm conducts random simulations — often termed “rollouts” — by selecting moves arbitrarily until a game’s conclusion is reached, thereby evaluating the node’s potential.

d) Backpropagation: Post-simulation, the outcome (win, loss, or draw) is propagated back to the root, updating the statistical data (e.g., wins, losses) of each traversed node to inform future decisions.

ii) Upper Confidence Bound applied on Trees

Algorithm is crucial for the selection phase in MCTS, balancing exploration and exploitation by choosing actions that maximize:

Where ¯Xj is the average reward of action j, NC is the total visited times of the father node, and nj is the number of times node j has been visited for simulation, C is a constant to balancing exploitation and exploration

iii) MCT Self-Refine

algorithm represents an integration of Monte Carlo Tree Search (MCTS) with large language models, abstracting the iterative refinement process of mathematical problem solutions into a search tree structure.
This algorithm’s operational workflow adheres to the MCTS algorithm’s general pattern.
employed self-reflective driven self-improvement for refining answers; rewards for different answer versions are sampled using the model’s self-reward capability.

Methodology

i) MCTSr Worflow

Workflow of MCTSr is outlined below, with Agents can learn decision-making and reasoning from the trial-and-error as humans do.

The main workflow of MCTSr is structured as follows:

Initialization: A root node is established using either a naive model-generated answer and a dummy response (e.g., ’I don’t know.’) to minimize model overfitting tendencies.
Selection: The algorithm employs a value function Q to rank all answers that were not fully expanded and selects the highest-valued node for further exploration and refinement using a greedy strategy.
Self-Refine: The selected answer a undergoes optimization using the Self-Refine framework [2]. Initially, the model generates a feedback m, guiding the refining process to produce an enhanced answer a′.
Self-Evaluation: The refined answer is scored to sample a reward value and compute its Q value. This involves model self-reward feedback and constraints such as strict scoring standards and suppression of perfect scores to ensure reliability and fairness in scoring.
Backpropagation: The value of the refined answer is propagated backward to its parent node and other related nodes to update the tree’s value information. If the Q value of any child node changes, the parent node’s Q is updated.
UCT update: After the Q values of all nodes are updated, we identify a collection C of candidate nodes for further expansion or Selection, then use the UCT update formula to update the UCT values of all nodes for the next Selection stage.

The algorithm iterates through these stages until a termination condition T is met, including rollout constraints or maximum exploration depth, continuously refining the quality of answers, and exploring new possibilities.

ii) Self-Refine

model is guided by a multi-turn dialogue refine prompt to optimize an answer a to problem P
Initially, the model generates a reflective or critical comment m regarding a.
Subsequently, guided by m, the model modifies a to produce an improved version a′.
This iterative refinement enhances the quality of the response, leveraging structured feedback to drive the evolution of the answer.

iii) Self-Evaluation

In the refining process for mathematical problem P, the Q value of an answer a is defined as the expected quality of further refining a into a superior answer, owing to the Markovian nature of the transition from a to its rewritten forms.
Unlike traditional MCTS where Q(s, a) estimates the value of action a in state s, Q(a) here derives from multiple samplings of the reward function values attributed to a.
The model utilizes a self-reward method to estimate rewards for a, where it is required to provide a reward score ranging from -100 to 100
To address this, three constraints are designed:

(1) Prompt Constraint: The model must adhere to the strictest standards during reward scoring.

(2) Full Score Suppression: The model is instructed not to provide full feedback scores; any reward above 95 is reduced by a constant amount to curb excessive scores.

(3) Repeated Sampling: Each visit to a search tree node involves the repeated sampling of the node’s rewards to enhance the reliability of the Self-Evaluation. It should be noted that when reward sampling is performed on the child nodes of a node, we will also perform reward sampling on its parent node to increase the sample size of reward sampling.

iv) Backpropagation

After all leaf nodes’ reward value sampling and Q value update are completed, we will propagate this change to its parent and ancestor nodes.
During this update process, if the Q function value of any element in the child node set Children(a) of a node a changes, the Q function value of the node is updated to

Where Q′(a) is the updated quality value of answer a that consider the impact from its children nodes, Q(a) is the naive quality value only consider its reward samplings, and maxi∈Children(a) Q(i) represents the highest quality value among the children of a.

v) Update UCT and Selection

After updating the Q values across all nodes in the tree, we proceed to the selection phase for the next round of choices. This process includes the following steps:

a) Candidate Node Selection

focus on selecting all leaf nodes and those that are not fully expanded, disregarding the history of refine paths is feasible.
But given that Large Language Models (LLMs), which play as policy in this task, can generate an infinite number of refine actions m for any answer state a, each node potentially faces an unbounded set of actions for expansion
propose two criteria for determining “full expansion”: the node’s children count reaches a predefined limit. And, At least one child node’s Q value exceeds the node’s
identify a collection C of candidate nodes based on these criteria for further expansion or Selection

b) UCT Update

Drawing from AlphaGo, we use UCT with the UCB-1 method to balance the exploration and exploitation of nodes; for node a in the candidate set C, its UCTa value is,

where Q(a) is the Q value of answer a, N(·) is the total visited times of given nodes, c is a constant to balancing exploitation and exploration, ϵ is a small constant for avoiding divided-by-zero.

c) Sorting and Selection: According to the UCT value of the candidate set C, we can select an optimal node to explore the refining process through greedy sampling or importance sampling.

vi) Termination Function

search termination function criteria T can derive from several conditions:

a) Early Stopping: Termination occurs when improvements in search results diminish or when consecutive searches yield repetitive outcomes.

b) Search Constraints: The search terminates once the number of rollouts reaches a predetermined limit or when one or more nodes in the tree satisfy the maximum depth constraint.

c) Advanced Criteria Based on Language Model Logits: The search concludes based on predefined metrics derived from the language model’s logits.

Once the Termination Function condition T is satisfied, we can gather the best answers from tree nodes according to Q values or other conditions.

Evaluation

To assess the MCTSr algorithm’s effectiveness in solving mathematical problems,employed LLaMA3–8B (Meta AI, 2024) as the foundational model, enhanced with MCTSr

i) GSM Benchmarks

evaluated the above methods on the test sets of GSM8K and GSM-hard, which involved typical and challenging mathematical problems, respectively
Table below shows Performance of MCTSr on the GSM Dataset

results reveal a direct correlation between the number of MCTSr rollouts and success rates, significantly improving as iterations increase, especially in the less complex GSM8K
more intricate GSM-Hard set showcased a performance ceiling even at higher rollouts, indicating the limits of current strategies against complex problems
This work demonstrates the algorithm’s capacity to enhance problem-solving performance and its varying efficacy across problem complexities

ii) MATH Benchmark

Table below shows Performance of MCTSr on the MATH Dataset

Level-1 results demonstrate the highest success rates, with the 8-rollouts MCTSr achieving a remarkable 90.16% success rate, solving 394 out of 437 problems. This level shows a clear progression in success rates as rollouts increase
At the most challenging level-5 part, the 8-rollouts MCTSr configuration yields a 34.06% success rate, solving 451 out of 1324 problems. This illustrates the increasing difficulty and the algorithm’s strained performance in highly complex scenarios
Overall performance across all levels shows a cumulative success rate of 58.24% with the 8-rollouts MCTSr, solving 2912 out of 5000 problems. This rate demonstrates a substantial enhancement from the Zero-Shot CoT’s initial rate of 24.36%.

iii) Olympiad-level Benchmarks

Results below shows Performance of MCTSr on Olympiad-level Datasets

AIME: From Zero-Shot CoT’s 2.36% (22 problems solved) to 8-rollouts MCTSr’s 11.79% (110 problems solved).
GAIC Math Odyssey: Showed substantial improvement, starting at 17.22% (67 problems solved) and reaching up to 49.36% (192 problems solved) with 8-rollouts MCTSr.
OlympiadBench: Improved from 1.25% (16 problems solved) in Zero-Shot CoT to 7.76% (99 problems solved) in 8-rollouts MCTSr.

Limitations

As a general decision-making framework, the potential applications of MCTSr in various scenarios remain to be explored further, such as in black-box optimization problems and self-driven alignment for large language models.
Additionally, the components of MCTSr are highly scalable, necessitating ongoing development to identify and compare a broader range of component algorithms, thereby enhancing the practical potential and effectiveness of the MCTSr algorithm.

Conclusion

demonstrates the effectiveness of the MCT Self-Refine (MCTSr) algorithm in enhancing the capability of Large Language Models (LLMs) to solve complex mathematical problems
integrating Monte Carlo Tree Search (MCTS) with LLMs, MCTSr addresses critical challenges in accuracy and reliability, particularly within mathematical reasoning tasks.
Experimental results confirm significant improvements in problem-solving success rates across multiple datasets, including notable performance in Olympic-level mathematical challenges

Paper: https://arxiv.org/abs/2406.07394

Code: https://github.com/trotsky1997/MathBlackBox

References:

Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B: A Technical Report by Zhang et al.
Self-refine: Iterative refinement with self-feedback. by Madaan et al. ArXiv, abs/2303.17651