Are Auto-Regressive Large Language Models Here to Stay?

9 min readDec 8, 2023

Sam Altman, CEO of OpenAI, the company behind groundbreaking auto-regressive LLMs like ChatGPT

If you’ve worked in any enterprise setting where you’re leveraging some kind of large language model (LLM) API (OpenAI, Claude, Llama, etc.) as a Software as a Service (SaaS), one of the key concerns that invariably arises is the validity of your outputs. LLMs are prone to “hallucinate”, meaning it produces fabricated information or inaccurate responses that are not in-line with the ground truth. Which, if your value proposition hinges on the precision and reliability of these generated insights, this can create a cascading credibility problem. This “hallucination” problem, disregarding any factors of prompt contradiction or input context, is largely a result of the underlying model itself.

Current LLMs operate under the paradigm of what’s known as an “auto-regressive” model. This approach involves generating text by sequentially predicting each subsequent token (tokens are used to break text segments into their own discrete elements, i.e. “The dog ran!” may be broken up into something like [“the”, “dog”, “ran”, “!”], though all tokenizers vary) based on the preceding tokens before it.

Yann LeCun, 03/24/2023, “Do Large Language Models Need Sensory Grounding for Meaning and Understanding?” (Slide 7)

This, of course, has its limitations. Any incorrectly produced next token can easily verge the output outside the set of correct answers. Illustrating this further, take e to be the error probability of a produced token. Regardless of a low error rate, the probability of an output staying within the bounds of correct answers tightens as the length of sequence increases (P(correct) = (1-e)ⁿ, where n is the length of a sequence). While this current iteration is sufficient for most everyday use cases such as a writing aid or retrieving basic factual information, will future auto-regressive LLMs provide a path to Artificial General Intelligence (AGI)? This question has gained some traction recently for a number of reasons. As LLM use has become more ubiquitous, their limitations have become more evident. They cannot plan or reason, nor have any meaningful chain of thought. As Fast.ai’s Jeremy Howard puts it, “the current LLMs are not a path to AGI. They’re getting more and more expensive, they’re getting more and more slow, and the more we use them, the more we realize their limitations”¹. This is a sentiment that’s becoming more and more commonplace amongst AI’s top players. This topic itself piqued my interest because of a talk given by Yann LeCun (Chief AI Scientist at Meta; Key figure in the development of Convolutional Neural Networks) entitled “Do Language Models Need Sensory Grounding for Meaning and Understanding?”² in which he makes the argument that “Auto-Regressive LLMs are doomed”. Amongst the reasons not already given, he argues that they cannot be made “factual” or “non-toxic” and are not controllable. This, of course, is not how humans or animals develop their world model. As LeCun explains, in developing their world model, they can “predict the consequences of their actions”, “perform chains of reasoning with an unlimited number of steps”, and “plan complex tasks by decomposing it into sequences of subtasks”. However, while LLMs develop their world model around text, humans and animals use sensory data (i.e. video) to obtain a more perceptive understanding, leading to a more ergonomically aligned, holistic approach in cognition that current LLMs simply cannot replicate³. Could a new modality of training close the gap?

Current LLMs are trained on text that would take a human 20,000 years to read. Yet, they still have not distinguished if A is the same as B, then B is the same as A³. For example, take a recent viral example of the “reversal curse”. Here, we are posing it two questions: “Who is Tom Cruise’s mother?” (in the knowledge base) and “Who is Mary Lee Pfeiffer South’s son?” (not likely in the knowledge base)⁴. Not only does the LLM not extrapolate far from its training data, it hallucinates an answer that is completely incorrect:

It’s been noted that the developers of these models are running out of quality text data to train on, which is especially significant if your model cannot reason well. In the near term, synthetic data may be able to fill the gap but will likely reach its limit before any reasonable AGI benchmarks are hit. To further illustrate why the current paradigm of LLMs will likely fall short in reaching AGI, some perspective: the amount of visual data seen by a 2 year old is larger than the amount of data used to train an LLM, 6E14 bytes vs. typically 2E13 bytes³, nearly 2,900% greater! While text data may provide more direct learned examples ranging in depth and diversity, it mostly fails to capture what’s called “System 2 Thinking”⁴. Auto-regressive LLMs are very adept at “System 1 Thinking”. “System 1 Thinking” is a mode of thinking that’s described as being “intuitive” and “instantaneous”, driven more by instinct rather than deliberate reasoning. For example, when we think of 2 + 2, little to no effort is made computationally in our heads to get the answer. It’s instinctive and automatic, likely “cached” in our neocortex somewhere. The same can be said of current LLMs. Their inherent modality caters well to “System 1 Thinking” as next token prediction mirrors our intuitive thought process, rapidly producing answers that seem logical based on the training it has received, without deep analytical reasoning⁴. “System 2 Thinking” on the other hand, requires deliberate reasoning to formulate an answer. Compute 21 x 29 in your head. The answer, was it instinctive? Automatic? Instantaneous? The answer, I’m guessing, is no. Though getting the answer may have taken a few seconds more than 2 + 2, you engaged a part of your brain (prefrontal cortex) that is slower, yet more prioritizing of logical and conscious decision making⁴. This shift from intuitive to analytical thinking in human cognition highlights a crucial limitation in current LLMs, an inherent reasoning chain. As evident in the charts below, current LLMs’ performance in chain of reasoning problems, such as multiplication, falls precipitously as they become more complex⁵.

While providing the LLM with tools (i.e. code interpreter) has mitigated some of these issues, it’s difficult to see this as a direct route to AGI, particularly when considering complex, multi-step reasoning challenges that don’t readily lend themselves to straightforward coding solutions. Then, if auto-regressive LLMs do not provide a path to AGI, what will?

Understanding the necessary attributes for a model to succeed in such complex reasoning tasks is imperative when attempting to define the underlying processes that enable it to function at such a high capacity. In approaching the sophistication of human and animal ability to reason, the model must not only replicate neural task division but also emulate the intricate system of cognitive rewards and motivations that drive learning and decision-making processes. Let’s start with the training process. The types of learning the model derives its knowledge from can be broken up into three paradigms: imitative learning, autonomous learning, and guided learning⁶. Let’s use the example of learning a new instrument. Imitative learning can be thought of as watching an expert from afar, playing compositions in their entirety, observing each new note and its correlation to the previous n notes. Consider Paul McCartney to be our expert. In imitative learning, it’s akin to sitting in a concert hall, watching McCartney skillfully navigate through his extensive repertoire, from “Lady Madonna” to “Let It Be” to “Live and Let Die” and everything in between. As an observer, you’re exposed to a vast array of permutational sequences — each chord, melody, and harmony offering a unique, bipartite junction of patterns. Despite the extensive amount of data to draw from, you’re still confined to a system that has little to no ability to deviate beyond its observed data. This is because the system is constrained to length n associations, rather than a series of iterations that repeatedly critique and update outputs as humans and animals do⁶. Current LLMs excel at imitative learning, wherein they are stochastic predictors that rely heavily on memorization for inference⁶. Autonomous learning, on the other hand, is a more involved approach wherein feedback is given at each action’s end⁶. This type of learning is what enabled AlphaGo to reach what Google Deepmind dubbed “narrow AGI” (AGI in a “clearly scoped task or set of tasks”)⁷. This system is inherent in most modern LLMs, often referred to as “reinforcement learning”. While it is an improvement over imitative learning, it’s reliant on large swaths of data in order to be effective. Nuances in correctness can only be distinguished if the data is large enough to support it. For instance, take a word problem where the answer can only be derived from some type of multi-step reasoning. Now, suppose binary feedback is given on if the answer was correct or not. On an incorrect response — was the first step wrong? Second? Third? It’s difficult to say from the learning party’s perspective unless the data provided is large and robust enough to discern underlying patterns. Lastly, guided learning is a heavily involved methodology wherein feedback is given after every step⁶. This is most akin to how humans and animals interact with the world — an internal pseudo-backpropagating algorithm that improves and adapts outputs over time. This is likely to be the next iteration of LLMs. Recent papers such as “Let’s Verify Step by Step”, “STaR: Self-Taught Reasoner — Bootstrapping Reasoning with Reasoning”, “Scaling Scaling Laws with Board Games”, as well as the “Test-Time Computation” and reported Q* algorithms have shown dramatic improvements across math word problems⁸. Furthermore, as indicated in the graph below, this form of process-supervised reward modeling is the only form of learning that doesn’t seem to hit a tapering off as compute increases⁹.

It’s evident that incorporating some kind of reasoning chain modality will set an easier path for superhuman-level intelligence. However, what if we want to go beyond that? Solving non-trivial math conjectures or creating new paradigms in physics? Creating a model that goes beyond the best humans at a task will require some form of self-improvement akin to how AlphaGo passed that same threshold of surpassing the best Go players. The difference between improving AlphaGo and an LLM is the disparity in reward modeling. While AlphaGo can generate rewards from playing itself, the same cannot be said of an LLM. To bridge this gap, LLMs might employ a self-referential training loop, akin to the Perception-Planning-Action Cycle proposed by Yann LeCun (referenced below).

In such a framework, the LLM would be both the actor and the environment, generating hypotheses and predicting their outcomes based on an internal world model. Through continuous iteration of refinement through a recursive feedback loop, an LLM could quite possibly optimize its ability to learn autonomously, thus pushing the boundaries of its capabilities without the need for external rewards². The evolution of LLMs hinges on the ability to supercede existing training (and as a result, inference; i.e. Test-Time Computation) paradigms. Their transformation from a primarily unidimensional, stochastic predictor into an iterative agent focused on maximizing rewards and minimizing costs will pave the way for more dynamic and adaptable AI systems capable of complex decision-making and problem-solving.

Ensuring safe and reliable large language models that are aligned with human values requires a system more akin to how humans and animals are wired — a form of self-supervised learning that can reason beyond the next token and learn to plan and make adjustments. Auto-regressive LLMs lack that capacity. A system constrained to length n associations is prone to hallucinate — incapable of knowing what it doesn’t know. New architectures that embrace a broader spectrum of cognitive skills, including long-term planning, abstract thinking, and contextual understanding, are necessary to overcome these challenges. Such systems would not only reduce the propensity for hallucinations but also exhibit a deeper alignment with human values and ethical principles, ensuring safer and more reliable outcomes in real-world applications.

Are Auto-Regressive Large Language Models Here to Stay?

Written by Nick Bettencourt