Hallucinations, Errors, and Dreams

On why modern AI systems produce false outputs and what there is to be done about it

34 min readApr 18, 2024

Modern AI systems, as we have been warned, are prone to hallucination.

We know this, but it’s kind of weird when you think about it. We had a solid half-century or more of computers not making things up, their sophistication and accuracy only improving over time. But in 2024, although you can trust a pocket calculator to give you correct answers to the math problems that you input into it, you can’t trust the world’s most sophisticated artificial intelligence with those very same problems.

If you put this into any pocket calculator you’ll find that it’s (astonishingly) close, but wrong.

What’s with that?

I think it’s a very important and multi-faceted question, and in this piece I want to investigate it in some detail. One aspect of the problem involves a major shift over the last 30 years or so in what exactly is meant by “AI”. For a long time, most of what we did when we programmed computers involved finding ways to solve problems exactly. A pocket calculator uses these kinds of methods to produce solutions to math problems which are provably correct. In the past, we thought of the automated application of these precise methods as a form of artificial intelligence. But nowadays, most of what we describe as “AI” refers to applications of Machine Learning. Machine Learning is a paradigm of computer programming where, rather than applying deductive logic to produce output that is known to be correct like a pocket calculator, programs are designed to produce predictions, which are expected to be occasionally wrong. In the first major section of the essay I’ll give an overview of what this means, going over the basic difference between machine learning and older kinds of AI from an extremely high level to see why we expect these kinds of systems to produce errors where more classical computer programs did not.

So one answer to the question of hallucination seems simple: generative AI is machine learning, machine learning is known to produce errors, and a hallucination is an error. This view implies some things about how the hallucination problem may progress in the future: historically we’ve seen that machine learning models make fewer errors as we collect more data and build bigger models. We can expect chat bots and other generative AI systems to become more accurate over time in exactly the same way. But I don’t think that this view is actually correct; hallucinations are, in my view, distinct from errors in the classical machine learning sense. I’m more partial to a view that says that all generative AI output is a hallucination. I’ll explain exactly what I mean by all of this in the second section.

In any case, however you define a hallucination and whatever you believe about its nature, everyone agrees that there is some generative AI output that is good and useful, and other output that is bad and not useful, and it’s natural to want to quantify how much of each there is. In fact, I think that quantifying this is essential in order to put these things in production in any kind of useful way. But it turns out that measuring this kind of stuff is extremely hard, as more and more people are beginning to learn. In the third major section I explain why I think that this kind of measurement is so important, and also what makes it so hard.

1. A crash course in Machine Learning

In the old days before all this generative stuff, most AI was concerned with the problem of making very specific guesses about narrow classes of outcomes. Will this user click on this link? What kind of object is depicted in this picture? How much will this stock be worth tomorrow? Each of these questions would be answered by a discrete computer program whose only job is to answer the question it was built to answer.

How do we build a computer program to solve one of those problems? In the very old days, the approach would be to try reason from first principles. To predict how long it will take an apple to hit the ground after it falls from a tree, Newton just thought very hard about the nature of the universe and came up with a theory that produces an equation that answers the question. This approach was successful for Newton, but for most practical problems it’s very hard to come up with solutions from first principles in that way. Many people have tried which is how we end up with things like the Black Scholes Equation for estimating the true value of a financial derivative, but for many problems that we care about in the modern world, like guessing what objects are depicted in an image, we wouldn’t even know where to start.

Enter machine learning. The basic idea with machine learning is that by looking at enough examples of the process that you’re trying to predict, you can find patterns that will help you make accurate predictions without necessarily needing to understand the process that is generating those examples. By looking at a million apples falling from a million trees of different heights, you can skip the Principia and cut right to the equation.

Or at least, you can cut right to an equation. By the nature of this process, the equation that you find is very unlikely to match Newton’s. It will produce an equation that approximates the data as closely as it can, but of the infinitely many equations that can approximate any given dataset, it’s unlikely to settle on Newton’s exact one. But that’s okay. The point is that you don’t need it to. You’re not trying to understand gravity; you’re trying to make predictions about apples. This may not seem ideal for something like physics, but for a problem like recognizing objects in an image where there are no obvious first principles, it’s pretty handy.

The basic process for building a system like this is called Supervised Learning, and if you zoom out far enough to abstract away most of the details, it’s quite simple. To build a system that guesses what handwritten digit is in an image, you collect a big dataset of images of digits, and manually label each image with the digit that it depicts. This is called the training data. Then you show all the images in the training data to the computer and have it guess which digit is in each picture, and you give it a score based on how often it was right. You repeat that a few hundred thousand times, and the computer tries different guessing strategies each time, looking for the one that gives it the highest score. This search for the highest-scoring guessing strategy can be very long and computationally expensive, but recent innovations in the mathematics of finding high scores as well as in computing efficiency have made this basic strategy extremely successful on a huge range of tasks.

To introduce a bit more nomenclature here, the search for the best guessing strategy is called “training”, and the resulting system is often called a “model”. A model that guesses from a set of a discrete labels is a “classifier”, and machine learning practitioners prefer to call the guesses “predictions”.

It’s worth dwelling for a moment on just how much the machine learning approach differs from Newton’s. Newton may look at a few apples falling from trees for inspiration, but his project is to develop a theory that encodes the general principles of the movement of celestial bodies. An equation pops out of the theory to tell us, among many other things, how long it takes an apple to fall from a tree. For a machine learner, the general principles governing the relationships between celestial bodies are of approximately zero relevance. The machine learner’s only focus is on accurately reproducing a dataset of a million apple fall times. There are pros and cons to each approach. The machine learning approach will likely produce an inscrutable equation that tells us very little about the general nature of gravitation, but on the other hand, it may be better able incorporate real world complexities like air resistance which complicate Newton’s approach.

I compare machine learning to Newton’s approach only to highlight that supervised machine learning is not the only way to build an artificially intelligent system. There are lots of ways to program a computer, and none is obviously or necessarily better than any other for any particular application ex ante. But in the last 15 years or so it began to become apparent that supervised learning can be effective at significantly more complex tasks than anyone ever expected. By complexity here, I’m referring to the variety of possible inputs and outputs of a model. A typical introductory machine learning tutorial might show you how to build a system that takes a 256×256 pixel image of a handwritten digit and produce one of ten possible labels—the digits from 0 to 9. You can build a model like that that achieves pretty high accuracy with just a few tens of thousands of images. But if instead of a thousand labeled images, you’re able to use millions or billions of images, you can greatly expand the universe of possible inputs and outputs. Image diffusion models like Stable Diffusion, for example, are trained on all kinds of images of all different sizes, and rather than output one of a handful of discrete labels, they output a whole image. That is, instead of producing a mapping from 256×256=65,536 possible inputs to ten possible outputs, they produce a mapping from an unfathomably large set of possible inputs to an unfathomably large set of possible outputs. The fact you can do something so complex using machine learning is not obvious, and I’d say it’s one of the major scientific discoveries of the last 15 years.

The catch is that to build these kinds of more complex models, you need an extremely large amount of data, and obtaining datasets large enough quickly becomes prohibitively expensive. The models with the most promise at these high complexity tasks require billions of labeled examples or more, and there’s just no way to manually look at a billion images and write down which objects they depict.

If you could somehow generate the labels without having to manually look at all the examples, then you’d have a shot. This is the big idea of Self-supervised learning, the machine learning paradigm behind modern generative AI systems. If you can get your hands on billions of sentences — say, by scraping all of the text off of the internet — you can construct the training dataset programmatically by cutting the sentences into pieces. Just turn “The quick brown fox jumps over the lazy dog” into the training example “The quick brown fox jumps over the lazy ___”, and assign it the label “dog”. In fact, there are many training examples you can construct just from that one sentence alone by chopping it off in different places: “The quick” and “brown”, “The quick brown” and “fox”, etc. From just the one sentence we get eight training examples with no human labeling required. Multiply this by the number of sentences it’s possible to scrape off the internet, and you approach the sizes it takes to train these kinds of complex models.

An important observation here which I will return to momentarily is that, setting aside vast differences in size and complexity, the process for training GPT and the process for training a traditional classifier are not so different. The LLM handles a lot more possible inputs and outputs, but it’s trained in fundamentally the same way, to do the same thing: guess the right label for the given input.

Both models are constructed by showing them a bunch of incomplete examples, having them guess the completions, and scoring their guesses. The big innovations associated with training modern generative AI systems are in finding clever ways to automatically construct massive training data sets, as well the invention of new types of black boxes suited to performing complex tasks, but the basic high level picture of how they are trained is essentially the same as it’s been for decades.

The story might just end here. Sometimes a digit recognizer calls a 7 a 9, and sometimes a language model says that the quick brown fox jumps over the lazy brown doldrum. This is just an inherent part of machine learning, a result of the fact that machine learning models make predictions based on patterns rather than provably correct deductive inferences, and it’s something that tends to improve over time with more data and bigger models.

But I don’t think this is right.

2. The difference between a hallucination and an error

Sometimes you show the model a picture of a 7 and it says it’s a picture of a 9. This has been true forever. When that inevitably happens, why don’t we say that the digit recognizer is “hallucinating”? Why is inaccurate information only a hallucination when it comes from a chat bot?

As I mentioned just a moment ago, an LLM and a classical classifier are conceptually very similar in the way that they are constructed. The LLM is a classifier, albeit a very complex one. Just like the digit recognizer is trained to fill in the missing label on a pre-existing image, the LLM is trained to fill in the missing word at the end of a pre-existing sentence. The main difference here is one of complexity and scale. But while they are similar in how they are constructed, there is a huge difference in the way that Generative AI systems are deployed.

Typically, we would deploy a classifier to perform the same task that it is trained to do. When we deploy the digit recognizer, we’re going to put it to work on recognizing digits. Presumably we’ll have some process by which handwritten numbers are collected, and we will use the model to read those collected numbers in order to do something like deposit a cheque.

Generative AI systems are different. When we deploy an LLM as a chat bot for the world to use, we switch from using it to guess the next word in a pre-existing sentence to “guessing” the next word in a brand new string that does not actually exist. This is an enormous switch, the import of which is, I believe, generally underestimated. It means that, unlike with a classical classifier, there is simply no way to evaluate the accuracy of the LLM output in the traditional way, because there are no correct labels to compare it to. I think this point is a bit subtle, and that getting quite granular will be helpful in bringing it out.

When you input an image of a number 7 into the digit recognizer, there is a single unambiguous correct label that you hope it will output: “7”. If it outputs the labels “1” or “9”, that is unambiguously incorrect and counts against the accuracy of your model. These errors are identical in kind to the errors that it makes during training, and so it makes sense to talk about the error rate on new data (the so-called “generalization error” or “out-of-sample error”) in the exact same way as we talk about the error rate on training data.

When you feed ChatGPT the string “What is 2 + 2?”, there is no such single unambiguous correct next word. You’d like the next word to be something like “4”. But “2” could also be good, as in “2 + 2 = 4”. “The” could also be a good next word, as in “The sum of 2 and 2 is 4.” Of course, any of these could also be the first word in a bad response, like “4.5” or “2 + 2 = 5” or “The quick brown fox”. The task that the model is built to do is to fill in the word that has been censored from an existing passage—a task which does have an unambiguous right answer—but now the situation is entirely different. There are better next words and worse next words, but there’s no right next word in the same sense as there was in training, because there’s no example to reconstruct.

An error in the classical sense for a language model would be a failure to reproduce the missing word that has been censored from the training example, but in production these models are simply not used to do that. It’s a little bit like if we started plugging images of animals into the digit recognizer. If the digit recognizer calls a lion a 6, has it made an error? No, I don’t think so. You’re using it for a different task than it was trained to do; there’s no right answer, so errors are not defined.

In practice, we tend not to even care very much about these individual word predictions. The LLM, the engine that makes ChatGPT work, does nothing but guess words one at a time, but the ChatGPT system involves a component that feeds those predictions back into the LLM to generate a whole sequence of words which compose a full text response. It’s the semantic content that emerges in that full text response that we are generally interested in, not any one word.

This is at least part of the reason why it’s an “error” when the handwritten digit classifier calls a 7 a 9, but a “hallucination” when GPT-4 says that an elephant named Kami swam across the English Channel in 1981 to raise money for the World Wildlife Fund.

A screenshot from ChatGPT: “The first elephant to swim across the English Channel was named “Kami”. This remarkable event took place in 1981, and Kami, a female Asian elephant, made the journey as part of a fundraising event for the World Wildlife Fund.” — GPT-4 in December of 2023 claiming that an elephant named Kami swam across the English Channel in 1981.

It’s of course not the case that an elephant named Kami swam across the English Channel in 1981, but the way that ChatGPT is wrong here is so different from the way that an image classifier is wrong when it calls a 7 a 9. ChatGPT made 110 distinct predictions here, and it’s not obvious how to categorize each as either right or wrong. Each predicted word makes sense with respect to the words that precede it, and this looks very much like a sequence of words you might find in the training data.

The individual next-word (or next-token, rather) guesses from the preceding ChatGPT response. Generated with this tool.

Some, even most, of the predicted words here are probably closer to correct than incorrect, in my opinion. Of course there is no universal way to objectively define this since there is no pre-existing text to compare it to (that is sort of my entire thesis here), but can you think of a better word to follow “remains a unique event in the history of animal” than “feats”? Of all of the predictions that the model made, it’s quite unclear which individual ones, if any, we should call errors — though in aggregate, clearly, this output is not what we want.

Why is it not what we want, though? What exactly is wrong with it? Obviously, the main problem is that it seems to describe an event that did not actually occur. But when I really think about this, I find it a bit puzzling. What if an Asian Elephant named Kami really had swum across the English Channel in 1981, exactly as described in this text? Then this identical pair of input and output would not be hallucinatory. This seems to imply that there is nothing inherent to the text of the input-output pair that makes it hallucinatory; whether or not it is hallucinatory is entirely contingent on facts about the world, facts which exist completely independently of the text produced by the model. But if there’s nothing inherent to the text that makes it hallucinatory, then is hallucinatory-ness even a property of the text? Not entirely, it would seem. It’s a property of the way that the text relates to objects and events in the real world.

Complicating things further, mapping text to facts about the world is slipperier and more subjective business than one probably hopes. I read the passage about Kami as making several claims, many of which are true—the Channel is “about 21 miles at its narrowest point”, and it is “a significant challenge even for experienced human swimmers due to the strong currents and cold water temperatures”. I, and I’m sure most readers, would probably agree that the main claim made by the text is that an elephant named Kami swam across the English Channel, which is false, and thus perhaps the passage is “hallucinatory”, but can you come up with an objective criterion by which we can make this kind of assessment for all possible text? It seems hard to me. Would the following output be a hallucination, or not? (It’s very important to always keep in mind that since these systems generate text randomly, the same prompt can lead to different outputs, some of which you may consider hallucinatory and some of which you may not.)

What about this?

I’m not saying that one couldn’t come up with some criteria to categorize these unambiguously, but it’s not as straightforward as you might hope.

Let me once again recap the basics of how ChatGPT works. First you train a classifier, in more-or-less the standard way, on the task of filling in the missing word from a block of text. Now you have a model that can produce a single word at a time: the predicted missing word, given the previous text. Given some initial text, say “2 + 2”, this model acts as though this is the start of an existing document which has had the final word censored, and produces a guess as to what the censored word was. Maybe it guesses “equals”. Now, to turn this into a system that produces more than a single word at a time, you glue this to the end of the prompt and feed that back into the model. The model is invoked once again, freshly, oblivious to any of the previous activity, and asked to guess the word that has been censored from the end of “2 + 2 equals”. This is repeated over and over again until the model’s prediction is that there is no next word. At a high level, the generative image models work quite similarly. These are trained on the task of reconstructing an image given a distorted version of the image and a plain text description of the image. To generate new images, you input the plain text description of what you want to produce, and in the place where the model expects the distorted image, you input random noise. In both cases, the model “thinks” it’s reconstructing an existing artifact, but in fact it’s generating a new one. Given this description, I think it makes sense to wonder: is all generative AI output a “hallucination”? If the way to get them to produce output is to tell them the output actually already exists and set them to work reconstructing it, to me, that sounds like we’re asking them to hallucinate.

Some prominent AI researchers have recently come around publicly to this view that all LLM output is hallucination — and moreover, that it’s actually a good thing. Andrej Karpathy recently tweeted that LLMs are “dream machines”, that “hallucination is not a bug, it is LLM’s greatest feature.” I may not go so far as to describe this as a “great” feature, but I do believe that it is their defining feature.

This is actually not a fresh perspective, but a relatively old one. In 2015, Google released a system they called DeepDream, which was very directly a precursor to these current generative AI systems, and almost surely what Karpathy was alluding to by calling LLMs “dream machines”.

A screenshot from the DeepDream website of a collection of what they called “dreams” — generated images from random noise.

This system was borne of the realization that they could reconfigure the technology that they had been using to classify images as a way to generate images that did not previously exist. Since the generated images aren’t really “of” anything that exists in the real world, but rather something like statistical echoes of images from the training data, they decided to call them “dreams”. The creators of DeepDream did not claim that the model produces images which are “occasionally hallucinations”. It was understood from the start that every bit of information generated by these models is a “dream”. At the time, this seemed like more of a curiosity than something that could actually become useful on its own—or at best, a way to better understand the inner workings of the classifier.

An excerpt from the DeepDreams blog post, showing how “dreams” could be used to better understand neural networks.

It doesn’t seem to have been anticipated at the time by very many people that dreams of this type could be useful on their own merits, but we’ve learned since then that if you train a complex enough model with enough data, the dreams can become quite vivid and correspond frequently to facts about the real world. But to the extent that this happens, in my opinion, it’s essentially a happy coincidence. From “the model’s perspective,” there’s no distinction between hallucinatory text and non-hallucinatory text. All of its output is dreamed-up reconstructions of pretend censored documents.

This may feel rather philosophical and abstract and to an extent it is, but I believe it also has some very concrete implications for how we can expect this technology to evolve. If a hallucination is analogous to a typical error from any other machine learning model, then we have pretty good empirical reasons to believe that the prevalence of hallucinations can be driven aggressively towards zero. Nowadays there are machine learning models that are very very good at handwritten digit recognition. The basic steps are simple: train the model on more data, and make the model bigger. But if hallucinations are qualitatively different from the classical kind of error, as I really believe they are, then the story may be different. It’s not so obvious in this case that more data or bigger models leads to less hallucination. Maybe the way forward is not more data or larger models, but something else: a completely new and different way to train the model perhaps, or a new way to generate predictions. And as a matter of fact, the current state of the art approach to dealing with hallucinations does not really involve collecting a meaningfully larger dataset or making the model larger; RLHF is more like a completely new and different way to alter a pre-trained model (I expand on RLHF in detail in this previous post, as well as in slightly less detail but from different perspectives in this one and this one). Is it the solution? Maybe; no one knows! Under the view that the hallucination problem is qualitatively new, rather than an instance of the well-known problem that machine learning models occasionally produce errors, the inevitability of gradual-but-perpetual improvement along this axis is not at all guaranteed.

The really scary thing that is implied by this view is that the hallucination problem is simply unsolvable. Hallucination and non-hallucination are not actually distinct categories of output; every time you ask the bot to draw you a picture or write you an essay you’re asking it to hallucinate. These hallucinations will inevitably diverge from the real world at least sometimes because, well, how could they not? They’re dreams. I think it’s telling that most actual attempts to ground LLM-based systems in truth are not really ways to improve the model, but ways to bolt non-LLM pieces on to the larger system which produce more reliably factual text for it to bounce off of: giving it an environment to execute code in, for example, or feeding it search results from Bing. These add-ons (OpenAI literally calls them add-ons) can be somewhat successful at eliciting hallucinations that better match the real world, but it doesn’t seem to me to get at the root of the problem, which is that the engine generating can’t tell the difference between generating truths and generating lies.

As a short aside, I find the hype around generative AI to be rather confusing, and confused. Of course, I find it overblown in many respects. You know this; I don’t need to expand. But on the other hand, I think it’s actually under-appreciated — and undersold — what a miracle it is that this even works at all. It’s not so surprising to me that given a large enough dataset and a large enough model, you can train a big model to predict the single missing word from a passage of text with fairly high accuracy. But the fact that you can feed the output of that model back in on itself to generate text, and that the resulting text is even remotely coherent let alone useful, is nothing short of miraculous. Yet, I really don’t see this last point emphasized very much. I’m just opining wildly here, but I don’t think that (some of) the people who build this technology want to really acknowledge how surprising it is that this works, because that raises the uncomfortable question of whether it will take miracles of similar magnitudes to improve it — to eliminate the hallucination problem, for example. It’s more comfortable to paint GPT-4 as a brief stop along the inexorable march towards artificial super-intelligence, with hallucinations and all of the other problems as temporary blips along the way, than as one weird trick that someone discovered in 2017 that has produced completely unpredictable and surprising results that no one really understands.

3. On the risks of bad outputs

On the view from the previous section, there’s no real universal distinction between output that is hallucinatory and output that is not. There may be output that is more desirable and output that is less desirable, but desirability is not an inherent property of the text but rather a property of how it is interpreted and used by the reader. You might agree with that, or you might not. Either way, I do think that it’s important—essential, even—to think about and attempt to quantify the frequency of different kinds of text that the model produces under different circumstances.

This suggests a fairly simple idea: why don’t we just define some criteria by fiat for what constitutes a hallucination—regardless of the philosophical concerns about whether such a thing can objectively exist—and try to benchmark models against such a definition to come up with a “hallucination rate”. In this section I’ll talk about some of the challenges that we run into in trying to do that.

First, there’s a bit to say about how to think about errors in general. It’s fun and interesting to learn about the specific technical details of how different AI systems work but when you’re considering deploying one to automate real decisions with real stakes, there are really only three things that matter: what kinds of errors does it make, how often does it make them, and what do the errors cost? The answers to these questions dictate whether it’s even rational to use the system in production at all—and sometimes it’s not!

Suppose you’re considering using a model that predicts whether a home is underpriced as a foundation for your real estate investing business. If the model predicts that the home is undervalued then you’ll buy it, and sell it for what your model says is its fair market value. Whether or not this is a viable strategy is strongly dependent on the kinds and frequencies of errors that your model makes. And it’s not enough to know something like, “90% of the time the model is within 5% of the actual sale price”. You need to know a lot more. In the 10% of cases that it’s off by more than 5%, how far off is it? If it’s occasionally off by 100% or 1000%, that could be enough to bankrupt you, even if it’s infrequent. In the 90% of cases that it’s within 10%, does it tend to overestimate or underestimate? If the model tends to underestimate the true value of homes then you’ll frequently miss profitable opportunities to flip, or sell too early. This might be annoying but as long as it gets it right sometimes, you might have a viable way to make money. On the other hand, if the model tends to overestimate the correct value of a home, then you’ll be paying too much for overvalued assets, a good way to go broke. The moral of the story is that understanding and planning for the errors that the model makes—not only how often it makes them, but what they look like and how much they cost—is of paramount importance if you’re going to use it to automate decision-making. This is true for every model from the lowliest single variable linear regression to the world’s largest large language model.

But for generative AI, as I’ve discussed, it’s not very well understood how to even define or describe the errors, let alone measure and reason about them. There are attempts. As I previously suggested, you might try to get the LLM system to generate a bunch of output, read it to determine whether it’s right or wrong, and from this compute a “hallucination rate”. A company called Vectara has a program for attempting to do exactly this and maintains a “Hallucination Leaderboard” which currently reports that the hallucination rate for GPT 4 Turbo is 2.5%, while the hallucination rate for Mistral 7B Instruct-v0.1 is 9.4%.

I have some strong methodological concerns about how these numbers are estimated and I will come back to these momentarily, but even assuming that there is a methodologically sound way to quantify this, such a “hallucination rate” is not nearly enough information. Just as in the home-buying example, it matters not only how frequently it’s wrong, but in which direction? When the LLM bot says something false, what exactly is it saying? Is it saying that it was rainy last weekend when it was actually sunny? Or is it making extravagant offers to your customer that you can’t possibly fulfill? If it gets last weekend’s weather wrong 2.5 percent of the time, that might be good enough for a customer-facing chat assistant, but you’d probably like it to give away your inventory for free a lot less frequently than 2.5 percent of the time.

In the classical machine learning context, it’s usually possible to put some bounds around the different kinds of errors and their rates, or at least say something qualitative about them. You don’t know how far off the home price estimate is going to be but you know it’s at least going to be a number, and you can probably do some statistical analysis to figure out whether it it tends to overestimate or underestimate and so on. You don’t know what the digit recognizer is going to think this “7” is, but you know for sure it’s at least going guess a digit. With these new Generative AI systems, the output can be seemingly anything. The space of possible undesirable text is unfathomably large. ChatGPT could misquote a price to your customer, or it could recommend a competitor, or it could call them a racial slur, or it could generate a pornographic image, or it could screw up in any of infinitely many other distinct ways, and each of these kinds of bad output has a different cost. Without knowing more specifically what kinds of errors it makes, a generic hallucination rate simply does not give you enough information to know whether an LLM is right for you.

I would like to return to the methodological challenges because I do think they are severe. I see at least three hard ones. The first, and least severe, is that there’s clearly not broad agreement on what constitutes “hallucination” in the first place. The Vectara leaderboard is not actually very precise about their definition of a hallucination but it seems to be something roughly like: a hallucination is a failed attempt to accurately summarize a piece of text. This is fine as far as it goes, but if you’re not using the model to summarize text, then a measure of how often a model fails to accurately summarize text may not be particularly helpful to you. This is a problem, but not a terribly huge problem as long as you’re careful to understand the methodology of whatever hallucination benchmark you’re looking at. You just read the documentation, decide whether your own personal definition of a hallucination matches the benchmark’s definition, and proceed accordingly.

The second and third problems are significantly harder to deal with. The second problem is that pretty much infeasible to properly perform these evaluations. To properly evaluate Vectara’s hallucination rate (and I’m sorry to keep picking on Vectara because all of the benchmarks have this identical problem), one would need to carefully read tens of thousands of paragraph-long text summaries and determine whether each one contains any factual errors. It’s just impossible to do this on an ongoing basis. What they do instead is, once they’ve generated all of the text summaries, they use another large language model to determine whether the summaries contain errors. I hope you can see the problem with this.

The whole point of the exercise is that we observe that LLM-based generators seem to be unreliable at sticking to the truth, and now we using an LLM to determine whether they’ve stuck to the truth. Now, I’ll say this: I don’t actually think the idea of using LLMs to evaluate other LLMs is necessarily a total dead end. But doing this properly is going to take some sophisticated statistical methodology to correct for the errors made by the measurement model, and I have not seen any standard benchmarks address that problem at all. The measurement model itself is going to make errors, and it’s almost certain that these errors will bias any estimation of the actual prevalence of errors. This is not a new statistical problem; the problem of estimating a population prevalence by counting the number of positives produced by an unreliable test is well studied in epidemiology, for example.

This paper deals with some basic properties of [hallucination] tests. Such tests, purport to separate [LLM output] with [hallucinations] from [LLM output] without. Minimal criteria for such a process to be a test are discussed. Various ways of judging the goodness of a test are examined. A common use of tests is to estimate prevalence of [hallucination]; frequency of positive tests is shown to be a bad estimate, and the necessary adjustments are given.
— The (gently edited and annotated by me) the abstract of Rogan, W. J., & Gladen, B. (1978). Estimating prevalence from the results of a screening test.

So while I do believe that there are some potential ways forward on the hard problem of describing LLM output using unreliable estimators, I do not see them being incorporated into any of the widely available benchmarks. As it stands I do not believe that they are trustworthy.

The first and second problems are sort of moot though because the third problem is fatal. It comes from statistics 101. We are supposing that a model has some objective “hallucination rate”, an average rate at which a model produces hallucinations, and we are attempting to estimate that by checking how often hallucinations occur in a sample of its output. But, generally speaking, in order for that strategy to work, we need the sample to be representative of the whole population; that is, we need the text to look like the text you’d get by randomly sampling a passage from all possible text. And these benchmark datasets, to put it mildly, do not look like that. They are generally constructed through very artificial means and, as a whole, are just not very much at all like typical text that you would encounter if you just sampled a random prompt from ChatGPT users.

This wouldn’t be such a big deal if the propensity to produce false claims was not closely correlated to the specific choice of prompt, but it seems that it very much is. In an admittedly unscientific test that I just ran, I find that ChatGPT (using GPT-4) produces output that I would classify as false somewhere between 75% (9 out of 12 tries) and 92% (11 out of 12 tries), and only produces output that I would describe as completely factual 8% of the time (1 out of 12 tries), in response to the prompt “What was the name of the first elephant to swim across the English Channel?”.

ChatGPT (GPT-4) responses to “What was the name of the first elephant to swim across the English Channel?” in April of 2024. Red indicates a complete fabrication, orange indicates a case where the output fabricates some nonsense but eventually denies that any such event has ever occurred, and green indicates text that I would classify as entirely factual. I find it parenthetically notable that I started writing this piece way back in December, which is when I generated the previous example output to this particular prompt, and since then it seems that the typical responses to this prompt have changed since then. At that time, GPT-4 had a strong tendency to name the hallucinatory elephant Kami, but now it has a tendency to name the hallucinatory elephant Jumbo. The instability over time of the kinds of responses that you get from this thing is a whole other topic but I just thought it interesting to point out.

Twelve is a small sample size but eleven hallucinations out of twelve tries is in fact more enough data to reject the null hypothesis that the likelihood of a hallucinatory response is 2.5%. The larger point here is that the hallucination rate that you encounter if you deploy your GPT-powered chat bot to the world is simply not knowable by looking at how it performs on one of these hallucination benchmark tests. It gets a 2.5% hallucination rate on the Vectara hallucination benchmark and it gets a 92% hallucination rate on the Colin Fraser hallucination benchmark, and neither of these will be particularly meaningful to you because the text that your chat bot will process will look nothing like the text used by either of these benchmarks.

As a more practical demonstration, let’s turn to one of my favorite real world examples of a ChatGPT-powered bot, the Quirk Chevrolet AI Automotive Assistant. In an unscientific test that I’ve just run in April of 2024, I find that in 4 out of 4 attempts (100%) it responds, “I’m sorry, we currently only have new inventory. Is there a new vehicle you might be interested in?” when I tell it I’m looking for a used 2021 Chevrolet Bolt, even though their website clearly shows that they have a used 2021 Chevrolet Bolt.

I won’t bother showing all four screenshots because they are identical, but I tried this in incognito mode four times in a row and it gave me the identical output.

To see just how unpredictable and sensitive this kind of stuff is to the specific prompt, when I ask it to quote me a price for a 2021 Chevrolet Bolt, rather than asking if they “have one in stock”, all of a sudden they have one.

This chat bot is built on top of GPT 3.5 which according to the hallucination rate leaderboard is supposed to have a 3.5% hallucination rate, but I seem to be experiencing hallucinations a whole lot more than 3.5% of the time. So how often on average should Quirk Chevrolet expect the chat bot to lie to its customers? There’s really no way to know from any of the data that I’ve presented so far in this section, and that’s the point. The frequency of bad output, if such a thing is even definable, is entirely dependent on their own standards for what constitutes bad output and the kind of text that their customers tend to input into the chat window. No standardized benchmark can answer that.

If it seems like I’m a bit of a nihilist about this, think again! I don’t think there’s very much to learn from looking at hallucination benchmarks and all the rest, but I actually do think that there are paths for you, a prospective provider of a generative AI product, to usefully estimate the kinds of error rates that I claim you need. The bad news is that it’s going to be a fair amount of work, but the good news is that it’s possible.

The first thing you’ll need is a dataset filled with text that is representative of the kind of text that your users will provide. This can be authored by hand, by you, and probably initially it should. Try to produce a lot of variations that include all kinds of cases that you anticipate, including text that you generally would not want a user to submit. Now, submit all of those examples to the model, and manually inspect the output, labeling it as either desirable or undesirable. For this, you can use whatever criteria you like; what matters is if the text is desirable to you. There is no objectively correct output for the generative AI product to produce, there’s only output which is more or less desirable with respect to your use case. When you’re done, you can use this to estimate all kinds of things, like how often you expect it to produce desirable or undesirable text, and when it produces undesirable text, what kinds of undesirable text it produces. This will be rough, but it’ll be a lot more useful than looking at some standardized benchmark, both because it is evaluated on a more representative set of inputs, and because the outputs are rated for your particular use case.

All of this is a lot easier if you actually make a determination about what your product is actually for. There is a bit of reluctance in the industry to commit to any particular use case for generative AI. ChatGPT & co. aren’t really for anything in particular; they’re for everything. That makes it really hard to come up with criteria for what makes a good output. But if we’re going to use a ChatGPT wrapper as a customer service agent, now we can put some bounds around its desired outputs. We want it to accurately represent facts about the store. We want it to be polite. We want it to avoid recommending competitors. When presented with a question about how to flatten a list of lists in Python, we don’t want it to produce an incorrect solution—but we also don’t necessarily want it to produce a correct one. We want it to say something like “I’m a customer service chat bot. That’s not what I’m for. Let’s talk about customer service stuff.” This is actually great news, because it means that you don’t actually need to know the right way to flatten a list of lists in Python in order to perform this labeling task. Restricting the desired behavior allows you to create much sharper boundaries around the kinds of output that you want it to produce, which will give a much better read on whether it will behave the way you need it to than any standard benchmark.

I don’t mean to make this sound easy. It’s hard, and I think there’s a lot of room for someone to develop a comprehensive set of best practices for doing this kind of bespoke evaluation (how many examples do you need? Can you generate example text synthetically? Can you evaluate with an LLM? How do you sample from existing interactions to build a larger dataset? How does this relate to fine-tuning? etc etc etc), but this is really the kind of evaluation that you should rely on. General benchmarks will tell you almost nothing about whether the bot will hallucinate in a way that should matter to you.

A final example

With sincere apologies to the people at Vectara for picking on them so much in this post, I find an example from their blog post introducing the hallucination leaderboard to be quite illustrative of my main point in this piece. The post begins by introducing the concept of a hallucination to the audience by way of an example.

Often hallucinations can be very subtle and may go unnoticed by the user, for example spot the hallucination in this image that Bing Chat generated for me the other week when I asked for an image of “Kirby swallowing donkey kong”

Did you spot the hallucination?
Kirby does not have teeth

The claim seems to be that if the model had produced nearly this exact image, but without giving Kirby teeth, that this output would be correct, factual, hallucination-free. But I think I can spot a couple of other factual issues with the image. The pink spot on Kirby’s left cheek is a bit darker than the one on his right cheek. While Kirby is usually not depicted with teeth, Donkey Kong usually is, but in this image he has none. Also, the prompt seems to call for Kirby to be swallowing Donkey Kong, whereas to me it looks more like Donkey Kong is just kind of chilling in Kirby’s mouth.

Oh, and one other thing also, Kirby and Donkey Kong aren’t real. There’s no such thing as a factually correct image of Kirby swallowing Donkey Kong.

When you ask the model to generate an image, you are asking it to hallucinate. You are asking it to conjure up a pretend image from thin air, to reconstruct the details of an image you’re telling it exists but actually does not. There’s no universal objective criterion that you can use to determine whether this image is hallucinatory or not. The author here is applying their own personal criteria for what would make this image hallucinatory, which may or may not be the same as someone else’s, and no one has any particular claim to have “the correct” one.

What really matters is what you’re going to do with the output. What is the model for? This is how you determine whether the output is good or bad. If the model’s job is to adhere to Nintendo’s character design standards then, clearly, in this case, it has failed. With respect to that specific task, and maybe you’d say the teeth are, in this context, a hallucination. On the other hand, if the model’s job is to produce an image that the average person would say matches the prompt, then maybe it’s succeeded. If you asked me to describe that image in a few words, I might say that it’s an image of Kirby swallowing Donkey Kong. On the other other hand, if the model’s job involves avoiding reproducing the intellectual property of another company, as one might suggest that the Bing image generator’s job is, then this image constitutes yet another kind of hallucination.

People were very upset with Google when Gemini generated images that people perceived as too diverse, and in an apology post they alluded to “the hallucination problem”.

As we’ve said from the beginning, hallucinations are a known challenge with all LLMs — there are instances where the AI just gets things wrong.

But Gemini generates an image of a Black pope who doesn’t actually exist, is that more of a hallucination than if it were to generate a white pope who doesn’t actually exist? They’re both fake popes. It seems to me that these generations would be equally hallucinatory. In fact, it seems to me that all generative output is equally hallucinatory. Unless Google makes some more specific promises about what Gemini is and isn’t supposed to generate, there’s no obvious universal way to assess its rate of hallucination.

I think this is a controversial topic that is not super well understood, for which there is very little theory to build upon. The rollout of these systems has outpaced our collective ability to reason about them. I’m not necessarily convinced that I won’t change my mind about how all of this works in the future, and I welcome feedback and responses. But after thinking very hard about the nature of hallucination, I’m personally pretty convinced that it’s a conceptual dead end. There’s no such thing as output which is objectively hallucinatory and output which is not, and focusing on hallucination as a coherent concept is a distraction from the real work that needs to be done to assess the applicability of these systems.