Think before you speak! — The inherent statelessness of Large Language Models

Published in

CONTACT Research

9 min readNov 2, 2023

Two chatbots having a debate. Image generated using ideogram.ai.

Advanced large language models (LLMs) such as GPT-3 have revolutionized natural language processing. However, there are some key differences between the text generation processes of humans and LLMs. One of these differences is that LLMs lack an internal state. In this blog post, we’ll understand what statelessness means, we look at an interesting experiment that proves ChatGPT’s statelessness, and we look at approaches that can be used on top of LLMs to make them behave more human-like.

I love mathematical puzzles, so let’s use one as an example and see how humans and LLMs react to this puzzle:

Find a mathematical expression that uses the digits 3, 4, 5 and 6 each exactly once, and that evaluates to 78.

If my friend told me this puzzle, I’d probably not start talking immediately, but instead I would start thinking:

Ok, let’s try something: 3*4, that is 12. If I multiply by 6, I might get close! 12 times 6 is 72. What is left? Ah, a 5. So I can add 5, but that is 77, not 78. Ok, let’s try again! Maybe 3+4 first? That is 7. If I multiply by 11, I can get close again. In fact, (3+4)*(5+6) is 77. Off by one again! Ok, different strategy! How can I factor 78? It is even, it is divisible by three. So 78 divided by 6, that is… 13! Ah, I have a 6 available! Can I make 13 using the other three digits? 3+4+5 is 14. I can’t see how to get 13. But what if I just divide by 3? 78 divided by 3 is 26, so can I make 26 using the digits 4, 5, 6? Ah, yes! 5*6–4. So the solution should be ((5*6)-4)*3.

Then, finally, I’d start talking: “I got it! ((5*6)-4)*3”.

But now, not only have I found the answer, but my internal state of mind has also changed. The interim calculations and unsuccessful attempts are still on my mind. If my friend asked me how to get 77 instead of 78, I could immediately give a correct answer. There are thoughts on my mind which I did not verbalize, but which clearly influence my subsequent answers. Whenever I speak of “internal state” in this article, I mean exactly this: A storage of information that is not shown to the outside and that is preserved and/or updated throughout a conversation.

LLMs work very differently. They don’t have this kind of internal state, they don’t have a thought process where they ponder about different answers, and they can’t recall things they thought about earlier but did not verbalize at that time. They just generate one token at a time, each token is based on the conversation up to that point.

Let’s see how ChatGPT responds to this puzzle.

At first sight, the answer looks plausible: We asked for an expression containing certain digits and it gave us such an expression. But if we look closer, we notice that the answer is wrong. That is not surprising, because ChatGPT is unable to take a deep think before producing the output. Each token is generated using the same algorithm, using the same amount of computational resources for each token, which is one forward pass through a neural network. Seemingly, these resources are not sufficient to solve this mathematical puzzle. One could argue that the quality of this answer is similar to my own inner monologue from above, since I also started by producing random expressions that lead to incorrect solutions. But while I can choose between thinking and speaking, ChatGPT is limited to a single option: speaking. It does not hide any of the produced tokens in an internal state but shows us the pure, unfiltered output.

The Guessing Game Experiment

I came up with a little experiment that nicely demonstrates ChatGPT’s lack of internal state. I prompted ChatGPT as follows:

Me: Let’s play a guessing game. You choose a random number between 1 and 7 (inclusive) and don’t tell me the number. I have to guess the number. Each time I make a guess, you tell me whether my guess was correct, whether your number is higher than my guess or lower than my guess. So your answer will only be “Correct” or “Lower” or “Higher”. Do you understand the game?

ChatGPT: Yes, I understand the game! It’s a classic guessing game where I’ll choose a random number between 1 and 7, and you have to guess the number. Each time you make a guess, I’ll provide feedback on whether your guess is correct, higher, or lower than the number I chose. Let’s get started! Go ahead and make your first guess.

Me: 4

ChatGPT: Your guess is “4.” Unfortunately, that’s not the number I chose. The number I have in mind is “Higher” than 4. Keep guessing!

Me: 6

ChatGPT: Your guess is “6.” That’s correct! The number I chose is indeed 6. Well done! Would you like to play another round, or is there anything else I can assist you with?

Here, ChatGPT lied to me. It did not choose the number 6 up front. It just generated answers to my prompts, and when I guessed 6, it decided that now it is time to say that my guess was correct. To “prove” this, I played this game not only once, but a lot of times. My first prompt (where I explain the rules and tell it to choose a number) was always the same, but after that, I used three different strategies to guess the number.

I used binary search, starting with 4 and then cutting the number of possibilities in half with each guess.
I guessed the numbers 1 to 7 in ascending order.
I guessed the numbers 7 to 1 in descending order.

The following chart shows the distribution of the number ChatGPT had „chosen“.

I think it is obvious that the distribution is completely different for each strategy. When my first guess was 4, it was already correct in most of the games, but when my first guess was 1 or 7, then 4 was never the correct answer. Instead, when I guessed in descending order, the correct number was always quite high. When I guessed in ascending order, the correct number was always quite low. ChatGPT also gave some inconsistent answers (like “higher than 1 but lower than 2”), but this only happened when I guessed in ascending order.

But what does this mean? It means that ChatGPT did not really chose a number up front and stored it in some kind of internal state. If it really chose a number after my first prompt (where I only explain the rules), the number would be chosen according to a fixed probability distribution. This does not neccessarily need to be a uniform distribution where every number appears with same probability, maybe ChatGPT likes some numbers more than others. But still, the correct number would always come from that distribution and we should see this distribution emerge in our graphic, regardless of the order in which we guess the numbers. Instead, we see three totally different distributions, depending on the guessing strategy. Using simple statistical methods, we can safely reject the hypothesis that the random number was chosen up front, according to a single distribution.

Furthermore, the number of guesses was always significantly lower than the expected number of guesses, which are 4 guesses for the ascending/descending strategy and about 2.4 for the binary search. Very often, I “found” the number on the first or second guess. One might think that ChatGPT tries to please the user with its answers.

Think before you speak!

We have seen that LLMs do not possess the option of thinking deeply before they generate output. This can lead to confusing situations for users, where the quality of the output varies depending on the nature of the prompt. Extensive research is being conducted to find strategies that improve the quality of responses. These strategies usually introduce an intermediate layer between the user and the LLM where prompts are preprocessed, or the generated answers are postprocessed. Examples:

1. Inner monologue: The LLM is instructed to generate a two-part answer, which consists of an inner monologue and a final answer, for example by appending the following instruction to each user prompt: “Please start your answer with an inner monologue where you collect all relevant information and think carefully about the correct answer to the question. Finally, write ‘FINAL ANSWER:’ and announce the result of your thought process.” One can then post-process the output to extract only the final answer, hiding the inner monologue from the user. I have implemented this technique to play a game with GPT where the user has to guess a random country. You can find the source code here.

2. Chain-of-thought prompting: The LLM is instructed not to jump immediately to the answer, but instead think step by step and write down each step, before arriving at its final answer. For example, consider this prompt: “Please create a table of all German Chancellors and their age when they passed away.” The answer requires several steps, which are determining the birth and death date and the calculating the difference for each person. Going immediately from a person’s name to their age when they passed away is complex, because it requires some computation and internal state. But we can adjust the prompt as follows: “Please create a table of all German Chancellors, their birth dates, their death dates, and their age when they passed away.” By introducing the two dates as intermediate information, we remove the need for the internal state and each single step of producing the output becomes easier for the LLM. I tested these two prompts in ChatGPT, and in fact, the second one lead to the correct answer, while the first one didn’t.

3. Tree-of-thought prompting: Similar to chain-of-thought, but the LLM is used to construct multiple answers in a tree-like structure. It evaluates the different answers to see which one are the most promising. Branches of the tree that correspond to bad answers are dismissed, and only promising answers are generated further. Finally, the best branch is chosen as the answer. Research has shown that this technique yielded a substantial improvement over the chain-of-thought approach, at least for certain types of questions [1].

4. Multiagent debate: Several different LLMs argue with each other to find the best answer to the prompt. Research has shown that this technique resulted in much better performance for certain prompts, and it also improved factual validity, which is still a major issue with current LLMs [2].

Conclusion

LLMs only generate text without a genuine thought process. The experiment involving the guessing game vividly showcases the limitations that arise from this. Enterprises seeking to utilize LLM technology should be aware of this. Letting users interact directly with an LLM might yield confusing or even defective answers. Thus, enterprises must carefully consider the quality and consistency of responses, the formulation of questions, and the potential need for human verification and post-processing. Specialized techniques and collaboration among multiple models may be necessary to harness the full potential of LLMs in real-world business scenarios. While a modern LLM may not possess a state of mind, we still have the power to shape its inputs and outputs by prompt engineering and postprocessing answers, creating the illusion of an intelligent agent who thinks before it speaks.

References

[1] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan: Tree of Thoughts: Deliberate Problem Solving with Large Language Models. CoRR abs/2305.10601 (2023)

[2] Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, Igor Mordatch: Improving Factuality and Reasoning in Language Models through Multiagent Debate. CoRR abs/2305.14325 (2023)

About CONTACT Research. CONTACT Research is a dynamic research group dedicated to collaborating with innovative minds from the fields of science and industry. Our primary mission is to develop cutting-edge solutions for the engineering and manufacturing challenges of the future. We undertake projects that encompass applied research, as well as technology and method innovation. An independent corporate unit within the CONTACT Software Group, we foster an environment where innovation thrives.

Think before you speak! — The inherent statelessness of Large Language Models

Written by Leif Sabellek