From Instinct to Insight: How ROAR Transforms AI Thinking

6 min readSep 10, 2024

Introduction

In recent months, there has been a growing recognition that the effectiveness of Large Language Models (LLMs) depends not only on the quality of the models themselves (although that is undoubtedly important), but also on how they are employed as part of a broader, more complex system. The way LLMs perform their tasks within such systems has become equally critical. The following insights are becoming increasingly apparent:

Explicit verbalization of reasoning steps: The quality of the model’s output improves when reasoning steps are explicitly articulated, rather than assumed as implicit parts of the model’s internal processing. Prompting techniques such as Chain-of-Thought and Tree-of-Thought, as well as innovations like “Reflection Llama,” all point in this direction.
Self-reflection and correction: LLMs are capable of improving their performance autonomously by reflecting on and correcting their own outputs.
System 1 thinking in LLMs: LLMs often generate responses similar to the human “System 1” thinking, providing intuitive, rapid answers based on learned associations. However, this can lead to errors, particularly in logical reasoning tasks. Numerous logic puzzles demonstrate that current LLMs struggle to identify the underlying logic of input elements. While approaches like “Grokking” aim to address this, they still primarily focus on generating an intuitive understanding of logic rather than a robust, explicit reasoning process.
Improved output with more computational time: When LLMs are given additional time to “think” (i.e., more computation cycles), they are capable of producing better, more nuanced answers. Several methods exist to achieve this effect.

Problem Hypothesis

In addition to the LLM’s inherent capabilities, the specific mechanism through which it performs inference — particularly when done iteratively — plays a crucial role. The instinctive responses generated by an LLM may be sufficient for simpler System-1-type tasks, but more complex reasoning requires a mechanism that can trigger and support System-2 thinking. This mechanism must:

Enable temporary storage and manipulation of intermediate results: The LLM should have a way to temporarily hold and work on intermediate steps before arriving at a final conclusion.
Dynamically allocate computational resources: The model needs a way to adjust the amount of computation time it uses based on the complexity of the task at hand. This additional computation should be allocated dynamically, based on the need for deeper reasoning.

Introducing ROAR: Recursive Optimization and Adaptive Reasoning

I propose a novel approach to address these issues, which I call Recursive Optimization and Adaptive Reasoning (ROAR). ROAR consists of two primary components:

1. Enforcing JSON-Formatted Output

JSON is a well-established format for text-based data exchange, and modern LLMs are familiar with its structure from training data. However, instead of merely requesting JSON output via prompting, the ROAR approach enforces JSON formatting by restricting the model’s output to only legal JSON tokens. This provides a kind of “scaffolding” for the LLM, helping it avoid irrelevant tokens and focus on producing structured, meaningful outputs.

By using JSON, it’s possible to define a dynamic schema that includes keys such as “Reasoning,” “Reflection,” and “Response.” This encourages the LLM to generate content that fits within these categories, effectively guiding its reasoning process. The LLM can still choose the order in which it fills out the schema, and some keys (such as “Response”) can be made optional or restricted until certain conditions are met. The data types of values can also be predefined, making the model’s output even more structured. Tools like Pydantic offer robust possibilities for schema validation.

One of the key advantages of JSON is that it supports System-2 thinking by allowing legal insertion of filler tokens like spaces, tabs, or newlines, which have minimal semantic meaning. These filler tokens give the LLM additional time (and generative cycles) to process its output before moving on to semantically meaningful elements. For example, the sequence

would be highly unlikely in regular LLM output, as it deviates too far from common training data. However, the sequence

is (at least) valid JSON. While there are already methods that involve the introduction of specific filler tokens, these approaches often require explicit fine-tuning. In contrast, JSON-based approaches do not necessitate fine-tuning, and models like Mixtral or Gemma 2 can adopt this technique even in zero-shot scenarios without any additional prompting.

For performance reasons, the number of allowable consecutive filler tokens should be limited. Future improvements could include automatically inserting filler tokens when the entropy of the generated probability distribution is too high, indicating uncertainty in the model.

2. Feedback Loop

Current transformer architectures operate unidirectionally, meaning that input data (prompt plus any previously generated tokens) is processed sequentially. While the attention mechanism allows each token to relate to every other token, the process is not cumulative. Transformers cannot reevaluate previously generated relationships based on new tokens. For example, if tokens A, B, and C appear in sequence, transformers can evaluate the relationships between A and B, A and C, and B and C, but they cannot reevaluate the relationship between A and B after encountering C. This limitation may be one of the reasons why current models struggle with common logic puzzles, although this is only speculation.

There is a need for a mechanism that allows the model to reassess previously processed input data based on new information. While one could consider extending transformer architectures with a bidirectional component, this would be computationally expensive and could significantly impact performance. However, this may not be necessary. Since input data is reprocessed for each newly generated token, allowing the LLM to generate structures similar to “Short Term Memory” could enable this reassessment. These structures could then be prepended to the prompt in subsequent generations, influencing the next set of tokens. This requires the iterative approach described in the following section.

Iterative Process

To enable the functionalities outlined in the JSON-based structure and feedback loop, an iterative generation process is required. The process works as follows:

The LLM begins by generating parts of the output JSON, possibly producing various elements before arriving at the final answer. The generation might be interrupted due to the excessive use of filler tokens or the sampling of an EOS token.
The generated content is read as a data structure. If the JSON is incomplete, tools such as json_repair can be used to salvage as much of the JSON as possible and to remove filler tokens.
Elements of the repaired JSON are either treated as “Short Term Memory” and prepended to the prompt (can be as simple as a dictionary structure merged recursively), or they serve as the basis for the final response.
The process is repeated from the beginning.
Once certain conditions are met, the generation process ends, and the final result is returned. In my experiments, placing the final “Response” key at the end of the schema definition has been sufficient to signal when the process is complete. The model then decides autonomously whether multiple iterations are needed or whether it can deliver an answer in the first pass.

By using techniques such as paged caching, the performance impact of this iterative approach can be minimized, as the entire key-value cache does not need to be recomputed.

Conclusion

The ROAR method offers a new approach for enabling LLMs to engage in deeper, more reflective System-2 thinking without compromising performance. By leveraging JSON formatting and iterative feedback loops, ROAR dynamically adjusts computational resources to meet the needs of complex tasks, fostering more thoughtful and accurate outputs. This approach holds significant potential for advancing the capabilities of future LLM architectures.

Unfortunately, I currently lack the time to conduct a rigorous scientific validation and evaluation of this approach. Therefore, I am sharing these ideas in the hope of inspiring others to explore, challenge, and build upon them. My goal is to foster collaboration and encourage further development of these concepts within the broader (research) community. Hopefully I will at least be able to publish my implementation for reference soon.