AI: Beyond chain of thought reasoning

Adrian Chan
27 min readMar 28, 2024

Exploring design of generative AI through human communication

3 Body Problem
Talking to “god” in 3 Body Problem

There is a moment in the Netflix series 3 body Problem in which the leader of a group housed on a freighter named Judgment Day is speaking to a distant alien intelligence. Explaining the circumstances of a children’s story, he falls into an explanation for how and why humans lie to one another. The alien is confused and seems unable to understand how communication could be subject to false representation or dissembling. The group’s leader explains that humans sometimes lie for no good reason, but at other times to conceal a truth or to even protect others from harm or concern. The alien explains that in amongst its species, communication occurs as soon as it has happened.

The exchange portrays the differences between machine communication and human communication.

In the cybernetic model of machine communication, sometimes mistakenly applied to human interaction as well, sender and receive transmit messages (content). Communication is the transmission of messages. It is assumed that sender and receiver understand the messages. That is, the message contains its meaning.

In human communication, on the other hand, any message involves interpretation. The meaning of the communication is not the content of the message, or the “utterance,” but the reasons or claims made by the utterance. The process of agreeing or disagreeing about these claims is the very purpose of human communication. The purpose of human communication is not the transmission of messages, orders, commands, and instructions — as would be the case with machine talk — but reaching understanding through communication. In communicating, we seek to understand what we’re saying to each other, and then agree or disagree, depending on the validity of the claims we make on each other.

I have included quotes below from some white papers published about reasoning in large language models. Chain of thought and other reasoning techniques are used to explore the capabilities of LLMs for communicating with users. This research is interesting, and provokes questions about what we can do with LLMs, and how to design them. As we will see, there is a lot more to human communication than reasoning with logical propositions, and much more to communication than producing linguistic statements.

We don’t know yet what AI can do. And we don’t yet understand what we might do with it.

Insofar as designers can and should participate in shaping the user experience of and with generative AI, communication presents the world of greatest opportunities. In leveraging communication we can open or narrow the interaction space, formalize instructions, pace and steer conversation, guide queries and searches to their conclusions, and so much more.

To understand the design opportunities in generative AI, we need to explore human communication, and consider the ways in which it can be used to shape the interaction with LLMs. As we will see, this involves a lot more than just chain of thought reasoning.

Mutual understanding as a feature of communication

In human social interaction, we can reach an implicit agreement with one another to try to find a mutual understanding about something. This is such a second-nature feature of social interaction that we don’t think about it. If we’re confused about what another person is saying, or means to say, or intends, we seek clarification. Even if we disagree with a person, we do it on the basis of understanding what they have said.

There is tight relationship between words used to express and convey content or meaning, and the commitment we make to each other to talk: to keep talking, stay engaged and interested, pay attention, and at least understand one another.

This would be the grail for large language models. But LLMs use words without understanding what they mean. They have no subjective self, no agency, and thus no ability to relate to others. At best they can only force (by prompting) reflection on the logic of their “thoughts” or formulations. This could mean that in the world of logical propositions, they can reason their way to a type of logical or algorithmic “understanding.” But they can’t reach understanding with another human subject about what is said, in the sense that we do, implicitly, when we are engaged in social interaction.

Designing Communication with LLMs

We design generative AI experiences to leverage natural human interaction models (speaking, writing, reading), and we integrate AI into existing software and applications for enhanced functionalities. To date, much of the exploration of natural language interactions has gone into prompting and prompt engineering. This is why research into chain of thought reasoning and other reasoning methods is so interesting. Reasoning research explores the degree to which LLMs can use human cognitive reasoning styles.

The goal of this research is to help expand the palette of prompts available for use with LLMs. If LLMs can be made to reason — or at least to reflect on their reasons — then functionalities and operations might be created for LLMs based on the use of straightforward logical instructions.

But I think that the most compelling opportunities ahead for the interaction design of generative AI lies in the communicative dimensions of language.

Human communication enables people to do things based on reasons that are not strictly logical and propositional, but which make use of reasons nonetheless. Could an AI reach an understanding with a person? What kind of reasoning would it have to use to reflect on whether it believes it understands the user: not just what the user has said, but what the user means to say in saying it; what the user intends?

Influence, not reasons

There are in human interaction other ways of convincing one another of the reasons for doing something. And they’re not necessarily logical. In fact they appeal in many cases to psychological biases and proclivities, to logical weaknesses, to habits and inclinations, dependencies, perceptions and sensitivities, and so much more.

In fact it’s interesting that there’s so little “cognitive science” in the world of LLM prompting, insofar as we do not worry about things like recency bias and confirmation bias with LLMs. That’s because they are not “social.” Nor are they living subjects. LLMs cannot have “recency” because they have no time, do not live through time or occupy time.

They can’t have recency bias any more than they can have expectations or forgetfulness. Likewise, without expectations, they can’t have confirmation bias. (The idea that there’s bias in the training data is a separate matter — not a clear example of cognitive bias but an attribute and aspect of language and representation.)

Can AI argue with us?

In any human communication situation, claims are made on subjects to accept or reject communication: requests, commands, appeals, recommendations, suggestions, and so on. When we make claims on each other like this, language is but one of the dimensions of communication used. We ask that we pay attention to each other; acknowledge each other; affirm and perhaps like each other; etc.

“Given a speaker’s need to know whether his message has been received, and if so, whether or not it has been passably understood, and given a recipient’s need to show that he has received the message and correctly — given these very fundamental requirements of talk as a communication system — we have the essential rationale for the very existence of adjacency pairs, that is, for the organization of talk into two-part exchanges. We have an understanding of why any next utterance after a question is examined for how it might be an answer.” Erving Goffman, Forms of Talk

A similar breadth of expression applies to the claims we make: we may provide good logical reasons for a claim, but we may also appeal to popularity, success, timeliness, urgency, exclusiveness, and so on. Just think of the claims made by political figures, news pundits, influencers, and advertisers.

A future advertising generative AI will not use logic to make sales. It will use reasons, yes, but these reasons will appeal to claims of a different kind. An AI used for sales and advertising, for example, might make claims to a product’s affordability, to the urgency of making a purchase, to a brand’s sex appeal, popularity, or exclusivity. AIs put to this use will make claims to the psychology of users, such as their susceptibility to praise and compliments, their competitiveness, or their need for attention.

The reasoning explored in the research I cite here is constrained to propositional logic and oriented towards understanding what models know, how they know it, and how they can be prompted to explain it. This kind of reasoning is far from the reasoning used in the dark arts of advertising. And so it comes up far short of making the claims upon users that are made in day-to-day, face-to-face interactions.

But I think that the design challenge will be in expanding this kind of inquiry into LLMs to indeed include other kinds of reasoning — the reasoning we use day-to-day in so many other kinds of decision making. What are the limits to self talk as a means of reflecting on reasons? Can an AI self-talk its way through the interpersonal or social validity of claims to reasons that can be motivating to a person, but which are not strictly logical? And can it develop arguments, or ways of communicating its reasons to a user, in a convincing fashion?

“What, then, is talk viewed interactionally? It is an example of that arrangement by which individuals come together and sustain matters having a ratified, joint, current, and running claim upon attention, a claim which lodges them together on some sort of intersubjective, mental world.” Erving Goffman, Forms of Talk

Can we share time with AI?

When people interact with each other, they share time together. In their time together, they create a stretch of shared time. During this time they are attentive to one another — directly and indirectly. This stretch or strip of time ends when the interaction breaks off.

It’s been seen that in forcing LLMs to engage in chain of thought reasoning, prompt engineers cause it to “take its time.” Prompts literally instruct the model to pause, or to wait. This isn’t a result of anthropomorphizing the design or interaction with LLMs. It’s meant to produce a break in the model’s “thought process.” Instructing the model to pause or to wait is not to tell the model to take time. Rather it is to tell the model to interrupt its work and to reflect on its own reasoning.

AI does not occupy time, of course. When we “take our time” to read through instructions and arguments, as we imagine our LLMs to do, we do in fact take time. The model has no lived experience and doesn’t persist through time as an identity, reflecting subject, actor or agent. As it doesn’t have time nor does it have speed. It doesn’t think slowly or quickly, or more slowly or more quickly. Its calculations are only sequential. Machine time is a clock time measured by the speed of calculations, or processing, and not the passage of organic time.

Asking an LLM to pause and to reflect, to take time to think, to consider its own reasons, is not the same as a human subject taking time to think. Nor does the time spent reasoning with an AI amount to sharing a stretch of time, or sharing time together.

All this is worth considering simply because the user experience design of generative AI does have to involve time, and will be shaped by considerations of fast and slow AI behaviors. The sense of spending time with an AI will be enhanced, from an interaction design perspective, if it seems that users and AI can get in synch with one another. (For backend processing and the application of AI to task automation, work can be as near-instantaneous as the model and its tasks allow).

Whilst AIs cannot technically experience the passage of time, or share time in communicating with us, they can perhaps be designed to seem to. This dimension of generative AI interaction design will be interesting, as the more that LLMs approximate human expressiveness and emotion, the more familiar and natural they are made to communicate, the more it will feel to us users that they are sharing time with us and we with them. A sense of being in synch, and of being together, will then necessarily feature larger in the behavioral design of generative AI.

“One does one’s thinking before one knows what one is to think about.”

“language is an organ of perception, not simply a means of communication”

Julian Jaynes The Origin of Consciousness in the Breakdown of the Bicameral Mind

Dolores in Westworld
Dolores, in Westworld

The self talk of AI

In the series Westworld, a character named Dolores comes to a realization about herself about the loops she has been living and re-living as her character is reset in the world she lives in. The realization involves an insight into a voice that she’s been hearing; and the discovery that in making this realization, she is free to make her own choices.

I asked ChatGPT what caused Dolores’ moment of self-realization. Note point number 6.

“Her pivotal moment of self-awareness occurs when she realizes that the voice she’s been hearing, guiding her towards the Maze, is not Arnold’s but her own.” ChatGPT

The episode in which Dolores achieves self-discovery is titled The Bicameral Mind. The title refers to the book and theory of Julian Jaynes, in which the author presents a model of consciousness tied to the split responsibilities of the left and right brain. In Westworld, Dolores becomes self-aware through self-reflection. That self-reflection is achieved linguistically: it uses words, but more importantly involves her recognition of her own voice.

The white papers cited below explore a version of self-reflection by generative AI large language models that can seem like an analog of the self-talk practiced by characters in Westworld. Their inner dialog was meant to build character by exploring and practicing depth of character. It’s actually fascinating, and even prescient of the show’s creators, that AI characters are seen doing this before the launch of ChatGPT.

But is the self-reflection illustrated in these white papers the same practice? The inner dialog, or self-talk, self-reflection, spelling out, or reasoning demonstrated is not a coming to consciousness of AI, clearly. Rather it is the production of words and phrases with which the AI can re-prompt itself. It’s the production of context, or the contextualization of expression as a form of virtual action.

The reasoning explored in these research examples might be called inner dialog but it is not dialogical in the sense of communication. It comprises logical propositions, or statements, used in a type of self-reflection forced by the model on its own generated output. Dialog as an example of communication between subjects involves actors addressing one another. Dolores, in her self-talk, addresses herself as a Self: that is the breakthrough of consciousness for AI in Westworld.

ChatGPT’s response to How does Dolores become self aware in Westworld Season 1
ChatGPT’s response to: How does Dolores become self aware in Westworld Season 1?

AI agents don’t “pay attention”

The reasoning practiced by LLMs when prompted to engage in chain of thought reasoning or step by step thinking is not self awareness, nor is it really inner dialog. Because generative AIs have no self, nor a sense of self, they are also incapable of showing or presenting themselves. They have no self to present. But they can seem to.

In human face to face interaction, presentation of self is an essential aspect of being present with and to others. And is an essential aspect of paying attention. We pay attention to others by showing them attention — by taking them into our attentive awareness. (This is what Dolores is doing as an AI in Westworld: paying attention to herself, or being self-aware.)

There is something in the interaction with generative AI that can give the impression that in their responses to us they pay us attention. From a design perspective, we as users become engaged. We pay attention to the task at hand (interacting), and it seems as if AI-generated responses pay attention to us. This is a sleight of hand, an illusion, an error of perception, or property of language. Which of these is a matter for debate, and contingent on one’s framing of the interaction model appropriate for human use of generative AI. It is, possibly, simply an act of projection (what some call anthropomorphization).

AIs don’t pay us any attention. They don’t present themselves, either to themselves or to us. They don’t make an appeal to our awareness and presence, insofar as we might be said to pay them attention or, opposite, ignore them. So the conversationality of interacting with an AI over a series of prompts and responses is not dialogical in the human, intersubjective, sense.

But from a design perspective, there is nonetheless an “as if” property to both the design of large language model behaviors, and to the design of user experiences. This is due to the simple fact that when we engage in talk, we draw on our competencies as independent speaking subjects — as persons.

The task, for designers, developers, engineers, and implementers alike, is to explore this space in which both the capacities of generative AI and the interest of users can be developed through the intrinsic properties of language and action.

There is something unique to human attention, in other words, that doesn’t exist in machine attention. Humans can appeal to each other’s attention and know that they’re being paid attention to. An AI doesn’t and can’t do the same. (Social media work by screening out this show of attention, thus causing many users to rely on signs of attention instead: likes, shares, emojis, etc. The UI of social platforms is replete with socio-technical architectural features designed to channel and present attention.)

Reasoning by models is monological. It occurs by prompting the model to reflect on and deconstruct or decompose its responses. The model is forced to generate additional content, in other words, to explain and expand upon its already generated response. On account of the logic within human communication on which the model is trained, the model can tease out the incremental logical steps and reasons for its answers.

In common discourse, in fact, we might rarely have to give the reasons for our decisions. And when it comes to making those choices, we often rely on authority. After all, we are ourselves not the best judge of which reasons are right and which are wrong, or which are better and which are worse.

Given the highly expert and specialized nature of much of the modern world, we rely on the authoritativeness of experts to substitute for logic and reasoning. The opinion of an expert will do.

“You think the grief will make you smaller inside, like your heart will collapse in on itself, but it doesn’t. I feel spaces opening up inside of me, like a building with rooms I’ve never explored.” — Dolores, Westworld

Implicit Rules

Interestingly, in human social interaction we have reasons and hew to rules and codes implicitly. Social codes of conduct are unwritten and unspoken rules of a sort. They are subtextual to our conversations and interactions. So much so that it would be a strange kind of prompting if we were asked to expose and articulate all of the individual, interpersonal, and social behavioral norms we follow in any given situation. This is what we’re in effect doing by asking LLMs to think logically, to follow and articulate step-by-step reasons, to decompose problems into constituent parts, to infer, deduce, or induce logical explanations.

How much of this is language games? How much of it is real? To what extent are we just learning how to read and thus engage with generative AI? And more importantly, how much closer can AI approximate our own reasons, our own rules and codes, our own contexts?

The power of social interaction in face-to-face settings, and to a lesser degree in online real-time experiences, is that so much is done without words. Words are allowed to manage and handle communication of content claims and expressions. The alignment of interest, sharing attention, space, time; acknowledgement of one another’s presence and displays and reciprocation of interest. All these and other implicit features of what make face to face social life interesting and essential, permit us freedom to engage in conversation without having to explicitly reveal our reasons.

Will more agentic AI understand our social rules?

As generative AI becomes more agentic, as it becomes embodied in robots and physical interfaces, and as it becomes more embedded in software tools and workflows, even social platforms possibly, does it also learn about these implicit social and cultural rules? Will there come a time when large language models not only converse with us competently and intelligently, interact with us as we engage in various tasks and workflows, and also hew to social conventions?

If AI could be prompted to share the reasons for certain social behaviors, could it do so? In what circumstances might a conversational agent be prompted to both articulate the correct content and also engage in a behavioral manner so as to promote certain kinds of user responses? Say, for an online educator or tutor? Can such an agent be designed to provide encouraging and supportive learning co-piloting, and if so what kinds of prompts would be designed into it to cover the range of pro-social behaviors that foster learning?

Now thinking even more dialogically, can we design models to use various social games to augment their logical thought processes? If so, there might be many prompting techniques for engineering agents to serve as everything from customer support to sales, technical support, medical support, education and tutoring, entertainment and gaming, and much more.

Teaching LLMs social rules

If chain of thought reasoning is a method with which to force an LLM to explicate its “thinking” step by step, the type of “reasoning” that “governs” human social behavior is far more complex. It is not only complex, but it is implicit. So for an AI to unpack the reasons that people behave the way they do in certain social situations, it would have to proceed not only through the “logic” of social codes and norms (of behavior and speech), but it would have to do so at a meta level.

Behavioral norms are not explicit at the linguistic level. They’re not expressed explicitly in language. So the AI would have to represent the social situation to itself, analyze relationships, situations, intentions and motivations, and risks, before even being able to explicate the options for things to say (and not to say). This seems nigh on impossible given the complexity of real-life human social situations. Whilst an AI might be able to do some kind of analysis of social situations, it would be an entirely different matter for the AI to participate in the situation.

And yet it does seem that much of this social coding comes through in language, and is indeed in the training and tuning of LLMs.

Generative AI, whilst it could never be an actual agent in human social situations, could increasingly fake it. And it does seem that a good fake, if online, might be enough to fool or compel users — to the degree, at least, that designers should take this seriously.

A lot of effort in LLMs is going into developing planning and action sequences so that LLMs can be used to execute, automate, streamline, co-pilot and assist in tasks, real and online. But the models are trained on events (clicks, selections, scrolling, etc) that are available and consistent. Rules can be extracted or learned from these kinds of events or user actions.

How would an AI learn the rules of social situations? These rules are implicit. They are not expressed linguistically when people interact, but are referred to or used prior to the use of language. The rules come through what people say and how they say it, but are not explicit in what people say.

It doesn’t seem likely that there could be a path by which LLMs might learn to reflect on the rules, codes, behaviors and such of open, social interactions and situations. The subtleties, nuances, complexities, and other contextual aspects of real-life situations are simply too much for any kind of processing. The spontaneity of real-life situations cannot be reasoned out or through.

But it might be possible to conceive of LLMs designed to communicate reasonably well, passably competently, if constrained to predefined contexts. An LLM might be designed to “do” law, or business analysis, or online dating matchmaking. Formal language settings do exist, and it might be possible to create “reasoning schemas” and “behavioral schemas” specifically for given situations or use cases. The LLMs would have to be given fail-over rules for how to behave and communicate when interaction steps outside the given format. But it’s something to consider.

As buried and deep as these rules might be, it is possible to imagine that with the right training, tuning, and either policies and rewards or reinforcement learning, large language models might learn rules separately from language and speech. Use of Mixture of Experts might be one path for this. Or fine-tuning on smaller models. But one could imagine that a large enterprise consultancy could train a model on its own business practices for example, teaching it both the content of business school as well as the professional speech and etiquette (not to mention presentation skills) of the consultant.

The following excerpts give examples of some common approaches to chain of thought reasoning and its variations in AI research. They highlight the importance of determining whether, and the extent to which, generative AI can think logically. Indeed, large language models generate responses to reasoning prompts that show they “contain” and can “articulate” clear incrementally-reasoned explanations for their own responses or “thoughts.”


React — Synergizing Reasoning And Acting In Language Models

  • An approach called ReAct not only surfaces reasons and reasoning but uses them to fetch additional contextual content
  • The prompt causes the LLM to expose its own reasons. Reasons provide useful language and terms with which to retrieve more information
  • The result should be more complete and fulsome responses, and an agent “primed” to engage in further conversation on the topic

“A unique feature of human intelligence is the ability to seamlessly combine task-oriented actions with verbal reasoning (or inner speech, Alderson-Day & Fernyhough, 2015), which has been theorized to play an important role in human cognition for enabling self-regulation or strategization (Vygotsky, 1987; Luria, 1965; Fernyhough, 2010) and maintaining a working memory (Baddeley, 1992). Consider the example of cooking up a dish in the kitchen. Between any two specific actions, we may reason in language in order to track progress (“now that everything is cut, I should heat up the pot of water”), to handle exceptions or adjust the plan according to the situation (“I don’t have salt, so let me use soy sauce and pepper instead”), and to realize when external information is needed (“how do I prepare dough? Let me search on the Internet”). We may also act (open a cookbook to read the recipe, open the fridge, check ingredients) to support the reasoning and to answer questions (“What dish can I make right now?”). This tight synergy between “acting” and “reasoning” allows humans to learn new tasks quickly and perform robust decision making or reasoning, even under previously unseen circumstances or facing information uncertainties.

“ReAct prompts LLMs to generate both verbal reasoning traces and actions pertaining to a task in an interleaved manner, which allows the model to perform dynamic reasoning to create, maintain, and adjust high-level plans for acting (reason to act), while also interact with the external environments (e.g.Wikipedia) to incorporate additional information into reasoning (act to reason).”

Metacognitive Prompting Improves Understanding in Large Language Models

  • Taking reasoning a step further, a large language model can be prompted to reflect on its reasons
  • It can evaluate its reasons using a method of reflection akin to meta cognition in humans
  • This isn’t meta cognition — the model doesn’t have meta cognition nor can it reflect on its thinking as a self aware subject
  • However the model can be prompted to chain its thoughts together and reflect on its own rationale

“While previous research primarily focuses on refining the logical progression of responses, the concept of metacognition — often defined as “thinking about thinking” — offers a unique perspective. Originating from the field of cognitive psychology (Schwarz 2015), metacognition pertains to an individual’s awareness and introspection of their cognitive processes. Informed by this insight, our proposed method, termed Metacognitive Prompting (MP), integrates key aspects of human metacognitive processes into LLMs. Figure 1 illustrates the parallels between human metacognitive stages and the operational steps of our method in LLMs. Rather than concentrating solely on the mechanics of “how” a response is produced, this method delves deeper into the rationale or “why” behind it. The method proceeds as follows: 1) the LLM interprets the provided text, a phase reminiscent of human comprehension; 2) the model then forms an initial judgment, mirroring the stage in which humans generate judgments based on information; 3) the LLM subjects its preliminary inference to critical evaluation, a step aligned with the self-reflection that humans engage in during cognitive processes; 4) after this introspective assessment, the model finalizes its decision and elucidates its reasoning, similar to human decision-making and rationalization; 5) finally, the LLM gauges its confidence in the outcomes, reflecting how humans evaluate the credibility of their judgments and explanations. This paradigm elevates the model’s function beyond simple systematic reasoning, compelling it to participate in introspective evaluations that determine the depth and relevance of its responses.”

Self-consistency Improves Chain Of Thought Reasoning In Language Models

  • Chain of Thought reasoning is the most common case of reasoning prompts
  • However, logic is not failsafe and LLMs are imperfect
  • This method prompts the model to conduct several reasoning processes, and then optimize across their solutions

“In an effort to address this shortcoming, Wei et al. (2022) have proposed chain-of-thought prompting, where a language model is prompted to generate a series of short sentences that mimic the reasoning process a person might employ in solving a task. For example, given the question “If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?”, instead of directly responding with “5”, a language model would be prompted to respond with the entire chain-of-thought: “There are 3 cars in the parking lot already. 2 more arrive. Now there are 3 + 2 = 5 cars. The answer is 5.”. It has been observed that chain-of-thought prompting significantly improves model performance across a variety of multi-step reasoning tasks (Wei et al., 2022).

We first prompt the language model with chain-of-thought prompting, then instead of greedily decoding the optimal reasoning path, we propose a “sample-and-marginalize” decoding procedure: we first sample from the language model’s decoder to generate a diverse set of reasoning paths; each reasoning path might lead to a different final answer, so we determine the optimal answer by marginalizing out the sampled reasoning paths to find the most consistent answer in the final answer set. Such an approach is analogous to the human experience that if multiple different ways of thinking lead to the same answer, one has greater confidence that the final answer is correct.”

LLM-Rec: Personalized Recommendation via Prompting Large Language Models

  • This prompt reasoning approach incorporates signals obtained from user activity
  • A recommender system, for example, can both reason out its recommendations as well as use past user history and behavior
  • It can also provide the user with recommendations and then reason from those

“We investigate various prompting strategies for enhancing personalized content recommendation performance with large language models (LLMs) through input augmentation. Our proposed approach, termed LLM-Rec, encompasses four distinct prompting strategies: (1) basic prompting, (2) recommendation-driven prompting, (3) engagement-guided prompting, and (4) recommendation-driven engagement-guided prompting. Our empirical experiments show that combining the original content description with the augmented input text generated by LLM using these prompting strategies leads to improved recommendation performance. This finding highlights the importance of incorporating diverse prompts and input augmentation techniques to enhance the recommendation capabilities with large language models for personalized content recommendation.”

Guiding Large Language Models via Directional Stimulus Prompting

  • If chain of thought reasoning is one method with which to optimize responses, another is to include clues and hints in the prompt
  • In giving the model hints as to how to answer a query, and where to look for additional contextual information, this method is in a sense decomposing the query
  • It is isolating and highlighting key terms to help the model retrieve information and formulate responses
  • In some ways this might be done with social cues — one might imagine hints to the model as to the social context in which a query is made
Example of Directional Stimulus Prompting
Example of Directional Stimulus Prompting

“To address the challenge, we propose a novel framework called Directional Stimulus Prompting (DSP). This framework introduces a new component called the “directional stimulus” into the prompt to provide nuanced, instance-specific guidance and control over LLMs. Specifically, the directional stimulus prompt acts as “hints” and “clues” for the input query to guide LLMs toward the desired output. Notably, this differs from the methods that augment LLMs with additional knowledge retrieved from external sources, as the directional stimulus prompt is generated solely based on the input query in our framework. Figure 1 compares our proposed prompting approach, DSP, with standard prompting for the summarization task. Our approach incorporates keywords in the prompt as the directional stimulus prompt to hint at key points the desired summary should cover. By providing this instance-specific guidance through directional stimulus prompt, LLMs can generate outputs that more closely align with the desired reference summary.”


To this end, our Directional Stimulus Prompting (DSP) approach introduces a small piece of discrete tokens z named “directional stimulus” into the prompt, which acts as hints and clues to provide LLMs with fine-grained guidance toward the desired direction.”

Least-to-most Prompting Enables Complex Reasoning In Large Language Models

  • Some user queries will be complex, and models can be prompted to decompose them into constituent parts
  • Answering the user query is more likely to be successful if the model first answer each partial question
  • One can imagine many queries that can be decomposed in this way
  • A conversational use case might involve user participation, engaging the user in helping the model stay on topic or task as it generates its response
  • One can also imagine many business and professional use cases for this kind of prompting, in which certain kind of critical or interrogative style is used (consulting, legal, financial, etc)

“Least-to-most prompting teaches language models how to solve a complex problem by decomposing it to a series of simpler subproblems. It consists of two sequential stages:

1 Decomposition. The prompt in this stage contains constant examples that demonstrate the decomposition, followed by the specific question to be decomposed.

2 Subproblem solving. The prompt in this stage consists of three parts:

1. constant examples demonstrating how subproblems are solved;

2. a potentially empty list of previously answered subquestions and generated solutions, and

3. the question to be answered next.

In the example shown in Figure 1, the language model is first asked to decompose the original problem into subproblems. The prompt that is passed to the model consists of examples that illustrate how to decompose complex problems (which are not shown in the figure), followed by the specific problem to be decomposed (as shown in the figure). The language model figures out that the original problem can be solved via solving an intermediate problem “How long does each trip take?”

Diagram of Tree of Thought

Large Language Model Guided Tree-of-Thought

  • Tree of thought prompting is an example of algorithmic reasoning, and prompts the model to follow a number of paths in its reasoning
  • This method might work for problems that could be looked at from different angles
  • The problem in this case doesn’t have just one solution, or can be solved with different types of thinking

“Fields Medal winner Terence Tao once shared his experiences solving hard math problems1: “When I was a kid, I had a romanticized notion of mathematics, that hard problems were solved in Eureka moments of inspiration… With me, it’s always, Let’s try this. That gets me part of the way, or that doesn’t work. Now let’s try this. Oh, there’s a little shortcut here… You work on it long enough and you happen to make progress towards a hard problem by a back door at some point. At the end, it’s usually, oh, I’ve solved the problem.” The problem solving process as he described is a tree-like thinking process, rather than a linear chain-of-thought.

Inspired by these two shortcomings of auto-regressive LLMs, we propose a novel framework which augments an LLM with several additional modules including an automatic “prompter agent”. This framework employs a solution search strategy we call the Tree-of-Thought (ToT2). This strategy solves a problem through a multi-round conversation between the LLM and the prompter agent. Figure 1a provides a visual description of the ToT search strategy, in which the LLM plays a crucial role in guiding the search for solutions.”

Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data

  • Fine-tuning models with chain of thought reasoning can be time-consuming and costly
  • Human participation is normally required in designing and writing the examples to be used in the reasoning chains
  • However models themselves can be used to generate reasons
  • And the models can then also be used to prune and optimize on the best or right reasons

“Especially, with the recently proposed chain-of-thought (CoT) prompting (Wei et al., 2022b), LLMs are capable of solving reasoning tasks including arithmetic reasoning, commonsense reasoning, and symbolic reasoning. The basic idea of CoT prompting is adding a few rationale chains to the answer as exemplars to illustrate the intermediate reasoning steps. Following CoT, several recent studies improve it by leveraging self-consistency (Wang et al., 2023), explanation learning (Lampinen et al., 2022), complexity-based prompting (Fu et al., 2023), self-training (Huang et al., 2022), voting verifier (Li et al., 2022a), and bootstrapping (Zelikman et al., 2022).

However, most of them are constrained to a few fixed human-written exemplars, which require significant human efforts to create and adapt to new datasets. The annotation process is nontrivial because humans need to not only select the questions but also carefully design the reasoning steps for each question. In the process of searching for the perfect exemplars, we identify four critical factors that affect the performance of chain-of-thought prompting and require large human effort to deal with: (1) order sensitivity: the order combination of the exemplars; (2) complexity: the number of reasoning steps of the rationale chains; (3) diversity: the combination of different complex-level exemplars; (4) style sensitivity: the writing/linguistic style of the rationale chains. Detailed analysis of the four factors is covered in Section 2. All of these sensitivities make human-based prompt engineering costly and motivate us to find an automatic and task-agnostic way to adapt chain-of-thought exemplars to any downstream tasks.

In this paper, we solve the problem by a CoT augmentation and selection process to find suitable exemplars automatically. This can be divided into three steps: (1) Augment: The language model generates multiple pseudo-chains for query questions automatically. (2) Prune: Based on an assumption: Generating correct reasoning is a necessary condition for generating correct answers. This assumption is natural because the answer is generated after several reasoning steps. When a correct answer is generated, the rationale chain of these steps is most likely correct, contributing to the final correctness. As a result, We prune the pseudo-chains according to the consistency between generated and ground-truth answers to reduce the noise. (3) Select: Given that all the data have been annotated with rationale paths, we propose to apply a variance-reduced policy gradient strategy (Williams, 1992; Dong et al., 2020; Zhou et al., 2021; Diao et al., 2022) to estimate the gradients and optimize the selection process to find the most helpful chain-of-thought for each task.”



Adrian Chan

CX, UX, and Social Interaction Design (SxD). Ex Deloitte Digital. San Francisco. Cycling, Photography, Guitar, Philosophy. Stanford ’88, IR.