Large Language Model Concepts for Curious Users: The Static, Stateless, and Directional Nature of LLMs

Dave Ziegler
3 min readJul 10, 2024

--

Three of the biggest misconceptions of how LLMs work explained.

LLMs are Static

When we say LLMs like OpenAI’s GPT series are static, we mean that their generalized knowledge (or weights), learned only during the training process, never changes from use (or inference). In other words, a model does not learn, adapt, or change without some form of training (whether that be the initial pretraining or future fine tuning). This is why model creators and providers give a cutoff date with their models.

So how is it services like ChatGPT seem to have more updated information that extends past the training date?

In order for models to work with data not seen during these expensive and intensive training processes, they must be provided with that new data in context (analogous to “working memory”) from some other source or mechanism. This includes the following:

In-Context Learning During Inference (providing new information in the prompt)

1) A user’s direct prompt

2) An application calling the model (using an API to interact with a model programmatically)

Retrieval-Augmented Generation (RAG)

3) Function calling and agents to obtain data from external sources (e.g., Google, Wikipedia, Wolfram Alpha)

4) Vector databases that can store information between prompts and even between sessions

Hallucinations

5) Finally, sometimes LLMs appear to answer with current information when they are actually hallucinating. Hallucinations occur when LLMs create grammatically correct but factually inaccurate text, and it’s a normal and expected behavior.

This happens because LLMs generate text based on probability and not rote, curated knowledge, and it’s important to understand that an LLM might appear to be giving current information when it’s simply making it up.

LLMs are Stateless

An LLM can be thought of as both a read-only data file (containing generalized textual knowledge, or weights) and mechanisms to interact with that data (attention and tokenization functions, etc.). A model itself doesn’t change from use (or inference), nor does it actually “run” or perform any operations between requests.

Furthermore, models like OpenAI’s GPT series don’t actually remember (or more accurately, store) any data from previous requests — it’s up to an application layer to maintain a running log of an entire conversation and present all or part of that log with each request. Doing so gives the illusion the model is having a back and forth conversation when actually it is simply processing the entire growing chat history each time.

As context grows (prompts, responses, data pulled from function calls, agents, and vector databases as explained above), so does the amount of memory and computing power required to process the entire context.

How a model is trained (determining the maximum possible context) and the resources available (GPU VRAM, computing resources) determine the maximum amount of context a model can handle, at which point some of that context must be omitted or truncated to continue working.

LLMs Generate Text In One Direction

LLMs generate new text based on a prompt using what’s known as the self-attention mechanism. It is this mechanism, described in the landmark 2017 paper “Attention Is All You Need,” that is responsible for the recent giant leap in generative transformer-based LLMs like ChatGPT.

This mechanism considers all the tokens in a prompt (see yesterday’s article on tokenization) and their relationships to each other to generate a single new token based on probability. Then the mechanism considers all tokens again including the new token to generate another, and so on, until the answer is complete.

This text generation continues in one direction, known as “forward pass,” and while the model will consider all current tokens each time to generate a new token, the previously generated text can not be changed or edited by the model during generation.

This directionality is yet another reason LLMs have difficulty adhering to word counts and suggested response lengths and represents another huge difference between how a human actually thinks and an LLM simply generates new text based on learned relationships in language and probability.

--

--