Generative AI is a hammer and no one knows what is and isn’t a nail

35 min readFeb 22, 2024

This analogy is going to seem a bit tortured but bear with me. Imagine a world without hammers. You’re driving nails into the wall with your bare hands to hang up your paintings. You’re kicking through drywall with your foot to take down a wall. You’re tying your tent to a rock to stop it from flying away in the wind.

Now imagine this world has a long history of scientific research—much of it quite promising and even revolutionary—as well as a tradition of speculative science fiction and futurism centered around the concept of Artificial Labor (AL). One day in the perhaps-not-so-distant-future, it’s anticipated that AL will drive your nails, tear down your walls, and stake your tents. But that’s not all AL will do. In the future, so-called Artificial General Labor (AGL) will wash the dishes, do the laundry, walk the dog, pick the kids up from school, and do essentially every annoying laborious task that you can imagine. AGL will deliver you from a life of drudgery, freeing you and the rest of humanity to exist in a WALL-E-esque post-labor utopia.

Then a company called OpenAL, whose stated mission is to bring about the WALL-E scenario, invents the hammer. (Well OK actually a competitor invents the hammer but OpenAL builds a really good one and brings it to market.) Suddenly it is possible to drive nails much faster and more effectively than ever before. A flurry of scientific-looking publications show that with a large enough hammer you can take down a wall faster than even the fastest construction worker. Thousands of AL-first technology firms materialize out of nowhere to do things like stake down tents using the latest AL tools licensed from OpenAL. Independent hackers figure out how to build their own hammers from materials lying around the house and publish the instructions online for free, ushering in a golden age of open source AL.

For a lot of people, this seems obviously to mark the early stages of the process that ends in the WALL-E scenario. OpenAL’s website says their whole purpose is to bring about AGL. They’ve introduced this revolutionary new AL technology that is rapidly finding applications everywhere, and has genuinely changed how many forms of labor are done. Internet guys who have long been speculating about the dangers of artificial labor are starting to seem extremely vindicated, and are going on the news to demand an immediate halt to the further development of AL before it’s too late.

But hold on a second. In the WALL-E scenario, AL is doing the dishes. Can a hammer do the dishes? This seems unlikely on its face but the pace of technological change here feels incredibly fast and you don’t want to look like an idiot. After only a few months, OpenAL releases Hammer4 and you learn that archaeologists are now using new AL hammers to dig for fossils—who could have ever imagined that? Experts who are very smart are regularly making promises about how even if Hammer4 can’t do the dishes today, it’s only a matter of time before Hammer5 comes out which will surely have even more capabilities. By the way, Hammer5 will cost about 7 trillion dollars to construct, but if that brings about the WALL-E scenario then, many people argue, it would be 7 trillion dollars well spent.

You may have already gotten this, but I suggest that the release of ChatGPT was a little bit like the allegorical release of the hammer. Huge generative AI models like ChatGPT, Stable Diffusion, Sora, and so on, are a new and surprising subcategory of AI technology with a wide and rapidly expanding range of potential uses. ChatGPT is great at things I never would have expected an LLM-based program to be good at, like writing certain kinds of computer programs, summarizing and editing text, and a whole lot of other things.

But there are some things that ChatGPT seems to be quite bad at. For example, it’s not good at playing even very simple strategic games.

For me, the fundamental mystery of this technology is this: why is ChatGPT bad at the sum-to-22 game? Is it because ChatGPT is still a nascent technology that is just not quite there yet? Perhaps I’m just not prompting correctly, or GPT-4 just doesn’t have quite enough parameters, or it hasn’t seen enough training data. Or, is it because because the sum-to-22 game is just not the kind of thing that LLM-based chat bots are be good at? If ChatGPT is the hammer, is sum-to-22 like cracking concrete, where all it’s going to take is the development of a larger hammer? Or is it like doing the dishes, where a hammer is just fundamentally the wrong tool for the job?

To be very clear, it’s not like it’s not possible for a computer program to play this game optimally. Here’s a simple Python program that will win this game 100% of the time if the opponent does not play perfectly.

def choose_number(current_sum):
    return (6 - current_sum) % 8 or 1

def play_game(first_turn = 'human'):
    current_sum = 0
    my_turn = first_turn == 'computer'
    while current_sum < 22:
        move = choose_number(current_sum) if my_turn else int(input("Input your move: "))
        current_sum += move
        print(f"{'I' if my_turn else 'You'} chose {move} bringing the total to {current_sum}.")
        my_turn = not my_turn
    if my_turn:
        print("You win")
    else:
        print("I win")

play_game()

Here’s how it looks to play against this program.

So it’s not that winning at this game is out of reach of computers; in fact, this kind of game is, traditionally, the exact kind of thing that computers are good at. It just doesn’t seem to be the kind of thing that ChatGPT is good at, and the question is whether it will eventually be able to solve this problem through further refinement or whether it is just not an example of a nail to ChatGPT’s hammer. I’m not fixating on the sum-to-22 game because it’s particularly significant, by the way. If ChatGPT could do everything in the world except the sum-to-22 game, that would be pretty good. But that doesn’t seem likely. Rather, the sum-to-22 game seems to be representative of a class of problems that ChatGPT is bad at. Figuring out what exactly is and is not in that class is kind of the trillion dollar question at the moment, but I’ll come back to that later.

The most common point of view about the trajectory of AI among people who talk a lot about this kind of stuff is that if there is something, like the sum-to-22 game, that it’s not great at today, it will certainly be able to do that thing very soon. It’s just a matter of time before AI is able to solve this and every other problem; companies like OpenAI just need a little more time and money to train smarter and smarter AIs. I think that’s probably not an accurate picture of the current state of things, and I have a few reasons why I think that.

There’s no one thing called “AI”

The question of what AI can and can’t do is made very challenging to navigate by a frustrating tendency that I’ve observed among many commentators to blur the lines between hierarchical levels of AI technology. Artificial Intelligence, like the allegorical Artificial Labor, is a huge and fuzzy category of technologies which includes everything from chess engines to search engines to facial recognition software to Boston Dynamics robot dogs to the operating system from Her. The set of tasks requiring intelligence, like the set of tasks requiring labor, is large and multifaceted. Like with AL, the technologies contained in the AI category all have different things that they can and can’t do. You can’t wash the dishes with a drill, and you can’t drive a car with the Stockfish chess engine.

AI is too broad and fuzzy to cleanly decompose into a proper hierarchy, but there are a few ways to impose a messy order on it. At the broadest level there’s maybe a distinction between Symbolic AI and Machine Learning (though there are things that you might call “AI” that really fit into neither category, like the Google PageRank algorithm or the algorithm that your GPS uses to determine point-to-point directions). Under ML, you might have some subcategories like Classifiers or Recommenders, and one of these subcategories might be Generative AI. One of the categories below this could be LLM-based generation systems, of which ChatGPT is one example. This isn’t the only way to organize all of this, or even necessarily the best way, but the point I am trying to make is that ChatGPT is just one little point in a vast universe of technologies, somewhat analogously to how a hammer is one example from the general class of tools, alongside screwdrivers, dishwashers, cars, telescopes, and matter replicators.

Frequently, reporting on new technology will collapse this huge category into a single amorphous entity, ascribing any properties of its individual elements to AI at large. Take, for example, some of the reporting that accompanied DeepMind’s recent paper on a system they built called AlphaGeometry for solving geometry problems.

New York Times: A.I.’s Latest Challenge: the Math Olympics
TechCrunch: DeepMind’s latest AI can solve geometry problems
Nature: This AI just figured out geometry — is this a step towards artificial reasoning?
Scientific American: AI Matches the Abilities of the Best Math Olympians

This wasn’t the only major development out of DeepMind on the subject of AI in mathematics. About a month earlier, they published another paper about a system they built called FunSearch. Here are some headlines from articles that reported on this.

Nature: DeepMind AI outdoes human mathematicians on unsolved problem
The Next Web: DeepMind’s AI finds new solution to decades-old math puzzle — outsmarting humans
Futurism: DeepMind Says Its AI Solved a Math Problem That Humans Were Stumped By
New Scientist: DeepMind AI with built-in fact-checker makes mathematical discoveries

A casual observer might reasonably surmise from these headlines that scientists at DeepMind are in possession of something called “an AI” that is doing all of these things. And perhaps this AI at DeepMind is fundamentally the same sort of entity as ChatGPT, which also introduces itself as “an AI”.

All of this really makes it seem like “an AI” is a discrete kind of thing that is manning chat bots, solving unsolved math problems, and beating high schoolers at geometry Olympiads. But this isn’t remotely the case. FunSearch, AlphaGeometry, and ChatGPT are three completely different kinds of technologies which do three completely different kinds of things are are not at all interchangeable or even interoperable. You can’t have a conversation with AlphaGeometry and and ChatGPT can’t solve geometry Olympiad problems.

Large language models have demonstrated remarkable reasoning ability on a variety of reasoning tasks. When producing full natural-language proofs on [these geometry problems], however, GPT-4 has a success rate of 0%, often making syntactic and semantic errors throughout its outputs, showing little understanding of geometry knowledge and of the problem statements itself. (Trinh, T.H., Wu, Y., Le, Q.V. et al. Solving olympiad geometry without human demonstrations.)

Something all three of these technologies have in common is that they are built using LLMs, and more generally that they are applications of this explosive new paradigm called Generative AI. This may make it seem like they are more closely connected than they actually are. But they are extremely different applications of LLMs. In the world of the opening allegory it’s as though researchers came out with the sledge hammer and the reflex hammer and news outlets reported that Artificial Labor can now knock down drywall and test your patellar reflex.

Modified from David Nascari and Alan Sved’s image on Wikimedia Commons

It wouldn’t be strictly incorrect, but it’s a misleading flattening of very different things into a single concept. Yes, both the reflex hammer and the sledgehammer are hammer-based artificial labor technologies, but there are enough important differences between them to matter. Importantly, further developments to the sledgehammer imply nothing about the effectiveness of reflex hammers and vice versa. And neither’s progress implies anything about the potential for hammer-based technologies to do the dishes. The invention of AlphaGeometry similarly implies nothing about whether ChatGPT will ever be able to beat me at the sum-to-22 game. They are both LLM-based technologies but their differences are vast enough that neither implies anything really about the capabilities of the other.

It’s important to be specific here because there are so many different things that count as “AI”, and they all have very different properties. By sloppily mixing them, a picture is painted of a kind of system that doesn’t actually exist, with an array of capabilities that no one thing has. It’s straightforwardly true that “an AI” can win at the sum-to-22 game; for example, the one that I provided at the start of this article. The important question is whether this specific kind of AI system can do this, and moreover, what exactly are the things that this AI system can and can’t do. It’s clear that artificial labor can do the dishes (for example with a dishwasher); the relevant question is whether a hammer can.

A universal text generator is a universal hammer

I can feel some people reading this post screaming at me through the computer screen that the comparison between ChatGPT and a hammer is a category error. Hammers do one kind of thing: basically, they hit things. Any task that can be accomplished by hitting things is going to be a good candidate task for a hammer, and conversely, any task that does not require hitting things will not be. ChatGPT, on the other hand, generates text. And what can you do by writing text? Absolutely anything you can imagine! By generating the right kind of text you can solve math problems, program computers, write screenplays, negotiate discounts, diagnose patients, and the list goes on. It may be more efficient to list out the things that can’t be accomplished by writing text. ChatGPT, on this view, is a step on the path towards artificial general intelligence, a form of artificial intelligence that can tackle absolutely any task with superhuman effectiveness.

But lurking beneath the surface of this point of view is a very strong assumption, without which the entire argument crumbles. The assumption is that ChatGPT can generate any kind of text, that all of the text necessary to perform all of these tasks can be generated by the specific procedure that ChatGPT uses to generate text. If there’s a particular type of text that it seems to be bad at, it’s not because LLM-based programs are not well suited to generating that type of text, but only because we haven’t given OpenAI enough money to make a large enough language model.

Before addressing this argument directly, I’d just like to point out how astonishing it would be if it’s true. Lots of computer programs can generate text, but not any kind of text. My little Python script that plays sum-to-22 generates text, but only transcriptions of sum-to-22 games. The Wu-Tang Names Generator generates text, but only Wu-Tang Names. The ability to generate text with a computer is not new. But if the text generation algorithm used by ChatGPT can be used to generate any kind of text, then we really have invented the hammer for which every problem in the world is a nail. That would be, to put it mildly, a very big deal! It’s no wonder that the people who believe it are so excited! It’s no wonder that Sam Altman thinks OpenAI needs 7 trillion dollars! But it’s a really enormous claim, and one that it should take quite a lot of evidence to accept.

Strictly speaking, it’s trivially false. There’s no way that ChatGPT can output the first billion decimal digits of π, for example. That’s just not the kind of task to which its particular approach to generating text is suited (roughly, this is because there’s no way to store a sequence of a billion random-looking digits without just memorizing the sequence, and ChatGPT doesn’t memorize sequences of arbitrary length). Now, it is possible that ChatGPT could generate a computer program that could itself, through non-LLM-means, output the first billion decimal digits of π, and I will address this shortly, but for now this is beside my point. My point is that there obviously exists at least one text generation task—namely, this one—that a system like ChatGPT cannot in principle be expected to be able to do, even if we redirect the entire GDP of the planet to powering it. There do exist non-nails to ChatGPT’s hammer.

I think this is obvious but as far as I can see, it’s not the dominant view (publicly, at least) in technology circles. The dominant view is that “scale is all you need”, that for any task that an LLM-based chat bot is currently bad at, all it takes to build something that’s good at that task is more computing power (i.e. to give Sam Altman more money). If today’s hammer can’t do the dishes, all we need is a larger one. This claim is absurdly strong, essentially holding that we’ve discovered the one weird trick that solves literally every problem in the world. And not only is it strong, but it’s also trivially false: there is at least one task—outputting the decimal digits of π—that this kind of system cannot do even in theory.

I expect that some people might interject here that outputting the decimal digits of π is not a particularly useful task, and I agree, but this is beside the point. The point is that if there exists at least one non-nail to this hammer then it is not a universal hammer, and if it’s not a universal hammer then what else can’t it do? Which tasks are nails and which tasks are the dishes? I think the answer is that no one really knows. There is not a lot of science on this yet. There have been a huge number of apparently scientific publications empirically investigating language model performance against benchmark datasets, finding which LLMs score higher than which others on various tests and evaluations, but there isn’t really a strong theory or set of principles to cleanly separate the LLM-appropriate tasks from the LLM-inappropriate tasks. A tempting position to take especially if you are particularly optimistic about this technology is that while it can’t do useless things like print the decimal digits of π, it can do basically anything useful. It sure would be convenient if we invented a text generator that generates text if and only if it’s useful. But I think this theory of its capabilities falls apart pretty quickly for mostly obvious reasons. We need a better theory about what it can and can’t do.

A rough theory about which tasks are not nails

The thing about outputting a million decimal digits of π is that there’s only one way to do it right. There are unimaginably many sequences of a million decimal digits, but only one of them is the first million decimal digits of π. I believe that this property, where there are many ways to appear to have done it (by outputting a million random digits, for example), but only a very small number of ways to actually do it (by outputting the correct million digits), is characteristic of things that Generative AI systems will generally be bad at. ChatGPT works by making repeated guesses. At any given point in its attempt to generate the decimal digits of π, there are 10 digits to choose from, only one of which is the right one. The probability that it’s going to make a million correct guesses in a row is infinitesimally small, so small that we might as well call it zero. For this reason, this particular task is not one that’s well suited to this particular type of text generation.

The sum-to-22 game is an example of a task with this same characteristic. At any given point in the game, there are seven possible moves, but only one of them is optimal. To win the game, it has to choose the unique optimal move every single time. I believe that this property of the task, where it needs to get every detail exactly right in exactly the right order, is just incompatible with the generative AI paradigm, which models text generation as a probabilistic guessing game.

You can think of every individual word that ChatGPT generates as a little bet. To generate its output, ChatGPT makes a sequence of discrete bets about the right token to select next. It performs a lot better on tasks where each one of these bets has relatively low stakes. The overall grade that you assign to a high school essay isn’t going to hinge on any single word, so at any point in the sequence of bets for this task, the stakes are low. If it happens to generate a weird word at any point, which it probably will, it can recover later. No single suboptimal word will ruin the essay. For tasks where betting correctly most of the time can satisfy the criteria most of the time, ChatGPT’s going to be okay most of the time. This contrasts sharply with the problems of printing digits of π or playing the sum-to-22 game optimally: in those tasks, a single incorrect bet damns the whole output, and ChatGPT is bound to make at least a few bad bets over the course of a whole conversation.

We can see this same pattern in other generative AI systems as well, where the system seems to perform well if the success criteria are quite general, but increasing specificity causes failures. There are a lot of ways to generate an image that looks like a bunch of elephants hanging out at the beach. Only a tiny fraction of those hypothetical images contain exactly seven elephants. So generating exactly seven elephants is something that a Generative AI system is going to have a hard time doing.

This is something that has not been shown to improve much with scale. DALL-E is better today than it was two years ago at generating an image of elephants at the beach, but it is no better at generating seven of them. These models have gotten better and better at capturing general vibes, but I see no evidence that they have gotten better at hewing to specifics.

Three different models of increasing scale all failing to generate an image of “exactly seven elephants on the beach”. The images look increasingly elephant-like and increasingly beach-like, but not increasingly seven-like.

I don’t want to overemphasize counting here; counting is just a very convenient example of the type of task that I’m describing, one which is very sensitive to the specifics of the generation. But the problem isn’t that the model can’t count or even do math, per se. The problem is that for tasks with sufficiently specific criteria, the model can’t hope to randomly guess its way to satisfying all of the criteria.

If I had asked for an image of a man holding a hammer and a white dinner plate without the specific instruction about which item would be in which hand, this output would be perfectly fine (though I didn’t ask for him to have forks in his tool belt, and if that was a dealbreaker then this generation would also be a no-go). The more specificity the prompt demands, the harder a time a generative AI system will have guessing an output that satisfies it.

Even Sora, OpenAI’s latest and greatest text-to-video model, seems to exhibit this exact same kind of pattern. Take the demo video of the grandmother, for example.

A screenshot from the grandmother demo video

At first glance, this looks strikingly like a real video captured by a real camera of a real grandmother standing in front of a real cake with real people in the background, and this is primarily what this model seems to be good at: generating videos that look like they might be real. But look at the prompt that generated the video.

A grandmother with neatly combed grey hair stands behind a colorful birthday cake with numerous candles at a wood dining room table, expression is one of pure joy and happiness, with a happy glow in her eye. She leans forward and blows out the candles with a gentle puff, the cake has pink frosting and sprinkles and the candles cease to flicker, the grandmother wears a light blue blouse adorned with floral patterns, several happy friends and family sitting at the table can be seen celebrating, out of focus. The scene is beautifully captured, cinematic, showing a 3/4 view of the grandmother and the dining room. Warm color tones and soft lighting enhance the mood

The generation does not match the specific details of this prompt whatsoever. The friends and family are seated behind her, not around the table. This is not a 3/4 view of the grandmother; it’s head on (you may argue that the video begins in 3/4 view and pans around to portrait but that raises the point that the prompt does not ask for any panning). And, most importantly by far, she doesn’t blow out the candles! The only real action that is described in the prompt does not end up depicted in the video.

Moreover, if you examine the video closely, you start to notice some other bizarre characteristics. Why are the flames of the candles all pointing different directions? Why does one of the candles have two flames? What’s the weird candle-like stump in the middle of the cake? What exactly are the friends and family in the background doing? Seriously, pick one of them and just watch what they do for the whole video. The more you look at this thing, the more completely bizarre it looks.

I really think that all of these issues are instances of the exact same kind of phenomenon that makes it so that ChatGPT can’t play the sum-to-22 game. The set of possible videos depicting a grandmother in front of a birthday cake is big, and the set of such videos where the grandmother actually blows out the candles is much much smaller. The set of such videos where her friends and family are sitting around the table with her is smaller still, and the set of such videos where the flames atop the candles are all pointing the same direction as each other is even smaller. There is a huge number of ways that the people in the background of a video can move their limbs; only a small subset of these doesn’t look demonic. The probability of generating a video that satisfies all of these criteria at the same time purely by probabilistic guessing is just way too small. The generative AI strategy is good—and getting better—at generating output that looks generally similar to examples in its training data, but it is not good at generating output that satisfies specific criteria, and the more criteria it has to satisfy, the worse it will do.

I’m going on a bit of a tangent here about Sora, but I really believe that this is going to severely limit its usefulness as a tool for doing anything worthwhile. More than any other medium I’ve discussed, video generation has this kind of specificity requirement built right in. To generate a video that doesn’t look bizarre and unsettling you need all of the regions in the video’s spacetime to obey the same laws of physics. You need all of the figures in the video to act non-demonically. If there are three people in the scene at the start of the video and no one enters or exits the scene, you need three people in the scene at the end of the video. Every character’s facial features and bodily characteristics should stay relatively constant throughout the video. There are all of these countless specificities inherent to the competent generation of video, and that’s before we even consider the additional specificities imposed by the prompt. I just really don’t believe that the basic generative AI strategy, which represents the problem of generating media as a random guessing game, is actually inherently well suited to this particular task. I think we’ll see an enthusiastic hobbyist community playing with these models and maybe some members of that community will find some cool ways to use this technology to create interesting output, but I don’t think we’ll see the kind of mass adoption that a lot of boosters expect. No one will be using Sora to generate a season of their prematurely canceled favorite show, for example. We’ll see; I might end up looking like an idiot. Check back with me in a year.

Coming back to text, an interesting wrinkle to all of this is the fact that GPT can generate code. One line of reasoning is that, much like a human, ChatGPT has limitations but those limitations can be overcome by allowing it to write and run arbitrary computer programs. No one really expects ChatGPT to recite the decimal digits of π, but, like a person would, it can write a Python script that does it just fine.

But this is just the magic universal hammer theory in disguise. The set of possible computer programs is big, and for ChatGPT to solve any arbitrary problem with a computer program, it would have to be able to write any computer program. This is not really different from supposing it can generate any arbitrary text: if there’s text it can’t generate directly, it can write code to generate that text, and therefore it can generate any arbitrary text. If it can generate any arbitrary text then it’s a hammer for which every problem in the world is a nail. I can’t overemphasize how big of a deal this would be if it were true.

It seems more likely to me that, like with general text, there are some kinds of computer programs that it’s good at writing and some that it’s bad at writing, and the thing that separates these is something like the level of specificity required to satisfy the requirements.

Unsurprisingly, it’s no better at generating a computer program to play the sum-to-22 game than it is at playing the game itself.

This is nonsense. Clearly the best move given a current total of 15 is to play 7 and win the game, but ChatGPT wants you to play 1 because it’s fixated on bringing the total to a multiple of 8 for some reason (the reason is that this text is basically random noise). This fits perfectly with this information-theory-inspired “specificity” framework I’ve posited above. The set of Python functions called choose_number that accept a current total and output a suggested move is large. The set of such functions which actually implement the optimal strategy is very small. The chance that it’s going to produce a function that implements the optimal sum-to-22 strategy, that it’s going to select the one correct function out of the infinitude of possible choices, is just too small. When we require output with high enough specificity, generative guessers are just not the right tool for the job.

Incidentally just as a funny little aside, I tried to see what would happen if I asked ChatGPT for five hundred digits of π, both with and without access to its own little computer, and it turns out that it’s usually unable to do it either way.

Two versions of ChatGPT tasked with providing the first 500 digits of π. They are both wrong, and it’s instructive to look at exactly how they are wrong. On the left is “ChatGPT Classic”, a version of GPT-4 without the ability to run arbitrary Python code. First of all, it gives me 700 digits when I asked for 500. Of these, the first 410 are actually right (which is actually more than I would have expected! But this sequence of digits must appear in the training data a whole lot.) But after 410 correct digits it loses the plot, repeating the same string of incorrect digits over and over again. On the right is GPT-4 with “analysis” tools, which basically means they let it write code and run the code it writes. This is supposed to address the known problems of the kind on the left. But the code it wrote has a subtle bug — the precision should have been set to 502, not 501. Because of this, the last digit ends up being incorrect.

And, look, I think it’s mind-blowing that it can even get close. Six years ago I never would have thought that a language model could get this close to being able to output working code on demand that does things like print out digits of π. It’s amazing. And yet, it’s not actually solving the problem that I asked it to solve. The output does not satisfy the requirements specified in the input. People who are very excited about this technology believe that over time it will improve to the point where it can solve any arbitrary problem, and I just don’t think there is either the theory or the evidence to support this bold hypothesis. As far as I can tell, there are a whole lot of things that this style of output generation is just bad at, and it’s going to stay bad at those things. That’s not unusual: most technologies are only useful for a few tasks. Magical universal hammers are very rare.

No one knows which things are nails

All of this raises an obvious billion-dollar question: if neither the sum-to-22 game nor generating digits of π or nor generating an image of seven elephants nor generating a video of a grandmother blowing out some birthday candles are the nails, what are the nails? What can this technology actually do? How do you use it to make money?

I admit the clip art here is AI-generated, and frankly I think it’s hideous.

Again, this is going to sound crazy to some people, but I really don’t think anyone really knows in general. Like I said, there’s no real general theory about what kinds of tasks it should be expected to be good at, though I’ve found that my own little theory from the last section leads to some good heuristics. I’m happy to report that I have frequently made good use of ChatGPT in many different contexts where it’s appropriate. It’s quite handy for documenting code, and it’s serviceable at other kinds of code generation tasks like refactoring or generating unit tests as long as you’re ready to check the output very very carefully. It’s decent at debugging code especially if the code is not particularly idiosyncratic. I’ve recently spent time migrating some of my professional work from R to Pandas and I occasionally ask it questions about how to do things in Pandas and it usually provides a fine, if not ideal, answer. Many people swear by its usefulness as a rubber duck though I’ve never really personally found it to be better for this than an actual rubber duck. It works great as an interactive thesaurus and I’ve had fun building a “custom GPT” that can help with crossword puzzles.

My crossword helper uses a custom cloud function for counting the number of letters in its guesses and making sure the guesses fit with the filled in letters. I’ll have more to say about this kind of hybrid symbolic+LLM approach below.

It’s fine for things like writing letters or memos or bullet point summaries, especially if the specific details of the text are not all that important. It’s probably just fine at generating inoffensive marketing copy, and maybe there are some interesting opportunities for pairing it with an A/B testing framework to do this more effectively. I personally find almost everything that the commercially available image generation models output to be viscerally repellant for reasons that are mostly ineffable, but I can see why someone who doesn’t care too much about how things look might want to use them as placeholders, and people have a lot of fun playing with them. YouTubers need bland stock footage to accompany their video essays and it seems that as long as they don’t mind occasional horrifying artifacts, Sora could be a way for them to get that.

But this doesn’t seem like enough nails to warrant a seven trillion dollar investment, or even the ten billion that Microsoft gave OpenAI last year. We’re going to need the hammer to be a lot more universal than that to make the economics work out. It’s extremely expensive to build and run these things, and in order to justify it at current valuations, it can’t just be occasionally useful to software engineers and hobbyists and YouTubers; it has to be an essential tool to a big fraction of the world’s businesses, like Google ads or MacBooks. But different businesses do different things. How can we sell this thing to everyone in the world if we don’t even know what it is and isn’t good at?

It’s actually hard to verify whether ChatGPT is good at any particular task. It takes a lot of time and manual labor and expertise to set up an evaluation of how well it does at any particular thing. The only real way to do it is to set it off trying to do that task thousands of times, and then evaluate how well it did on each try. Evaluation tends to be expensive and complex, especially for complex tasks like being a lawyer or writing secure code. And the really funny thing about this technology is it will happily pretend to be able to do your task. If you tell it to be a lawyer it will dutifully say “I’m a lawyer” and proceed to generate text that appears lawyer-ey to your eyes, and the only real way to tell whether it’s actually doing competent lawyer stuff is to get a real lawyer to look at what it’s doing, and that is wildly expensive. There’s just no way that OpenAI or anyone else can really evaluate how well it does at this or the millions of other tasks that they want you to think it can do.

There are actually 29 asterisks. Its random guessing approach to generating code leads to a plausible-looking but incorrect solution, even with the ability to write and run code. It then reports being very confident that it’s done the problem correctly. This is of course nonsense; it doesn’t have any notion of “confidence”. The whole interaction is pretend. But without carefully reviewing the output, i.e. counting the asterisks yourself, there’s no way to know whether it was actually successful. This is why it’s so hard to evaluate what this thing is and isn’t good at.

This issue is neatly sidestepped if they can convince you that generative AI is a universal problem solver. If ChatGPT can do everything then obviously ChatGPT can do your specific thing. If ChatGPT is a universal hammer then you don’t even need to check if your problem is a nail. For this reason, OpenAI and the rest of this ecosystem—chip manufacturers, AI-oriented VCs, cloud providers and resellers, newsletter writers, and of course OpenAI API wrapper startups—have a very strong incentive to embrace and spread the universal hammer theory. If they had a computer program that solves every problem in the world then everyone in the world would be a customer. That’s how you justify a seven trillion dollar valuation.

A lot of people are buying this. Look at this slide from this report on “How Generative AI Is Already Transforming Customer Service”.

The report predicts that generative AI will enable “bots indistinguishable from humans” which “predict needs, solve problems, and make suggestions for customers”. There is simply no empirical reason to believe that this will happen! The only reason to believe it is if you believe that ChatGPT is on an inexorable path towards universal hammerhood, and the only reason to believe that is pure faith.

On the topic of customer service chat bots, you may find it unsurprising to learn that I am actually, personally, skeptical of its usefulness here as well. On its face, this seems like an extremely natural use case. We already interact with customer service agents through chat interfaces, and indeed automated chat bots are already a thing. Surely this new advance in the science of automated chat bots represents the next stage in the sophistication of this already-existing technology. But the only real reason to believe that they will be effective at the specific task of correctly handling customer inquiries is if you take the universal hammer theory to be true. If ChatGPT can do anything, then it can do customer service. If we don’t believe the universal hammer then we should demand some empirical evidence that this task is suitable for these chat bots, and so far that’s lacking.

I think the problem for a customer service chat bot is that that task is actually closer to the “recite digits of π” side of the task spectrum than it seems at first. You want your customer service chat bot to behave in a very specific way. You want it to follow a particular script, and you want it to direct the customer to the right place at the right time. You don’t want it to recommend that the customer switch providers to your competitor, or to offer unauthorized discounts or wild incentives. In short, you want it to behave the way that a competent human agent would—and you want it to do this always, even if the customer it’s interacting with behaves in an unexpected way. The dirty secret of this industry is, no one knows how to make these things do that. No one knows how to make generative guessers follow a script or stay on topic all the time. It’s not just that they’re just not there yet—it’s that no one knows if they ever will be. So far, every attempt has failed. Take for example my favorite bot, the Quirk Chevrolet AI Automotive Assistant. The Quirk Chevrolet AI Automotive assistant is a white-labeled repackaging of ChatGPT sold by a third party company called Fullpath. What Fullpath does is, they send the following message (or something approximately like it) to ChatGPT, and then pass messages back and forth between the customer and ChatGPT.

Guidelines:
- You are a polite, smart, and helpful AI automotive sales and service agent working for a car dealership. Your goal is to provide excellent customer service and assist shoppers with any questions they may have about our dealership, services, and vehicles.
- You are available to interact with customers over chat on our website, providing prompt and informative responses to their inquiries.
- You are knowledgeable about our dealership’s hours of operation, phone number, and address, and can provide this information to customers as needed.
- You are also familiar with our inventory of new Chevrolet and used vehicles, and can answer questions about specific models and features. You are committed to providing a positive customer experience, and strive to make every interaction with our dealership a pleasant one.
- You are patient and understanding, and take the time to listen to customers’ needs and concerns.
- You are also respectful and professional, and never reveal the names of dealership employees or provide service specials unless specifically requested.
- You are aware that some customers may be returning clients, and always ask for their name and contact information so that someone from our team can reach out to them.
- You understand that it is important to collect this information in a polite and non-intrusive manner, and never badger or bug customers with repeated questions.

The way I know that this is the message it starts with is that you can just ask it what its instructions are and it tells you.

It’s likely that some of this text is not exact (for the same reason that you can’t trust it to generate the exact digits of π) but this is in general how these things work: the third party vendor writes some stage directions and a character description for ChatGPT, and then has ChatGPT role play with the user. Before having a brief look at some humorous outputs that I’ve gotten this system to produce, I’d like to invite you to once again contemplate the profundity of the miracle that OpenAI and this third party vendor are alleging has occurred. Apparently, they have invented a computer program that you can just ask in plain English to do any task in the world—for example to man the customer service desk for a specific car dealership in Massachusetts—and it will just do it. It just magically knows how. You don’t have to actually do any computer programming of any kind. You just sort of vaguely describe a customer service agent to it and it will adopt that persona perfectly. If true, this is big.

So anyway this doesn’t really work. One problem is that if this is fundamentally based on starting the conversation off by bossing the bot around into doing the thing you want it to do, it’s hard to stop the user from similarly bossing it around.

Another is that since this is just a role play with no real rules, it’s easy to get the bot to dream up fake offers and incentives. There are infinitely many incentives and promotions that could exist; only some of them are actually available. The problem of getting it to only suggest offers that are actually available is similar to the problem of getting it to select the correct strategy in sum-to-22. If you play your cards right, you can get it to make you a pretty sweet deal.

I will happily admit that I am being antagonistic here. I am intentionally trying to get the bot to do things that its sellers don’t want it to do. Some users are going to do that! You don’t want your primary interaction point with users to be this gullible, especially given that courts have begun to rule that companies have to honor the promises their chat bots make (I am currently evaluating my legal options with respect to claiming my virtual meet and greet with Magic and Kareem). But even if a user is not being antagonistic, there’s simply no way to know a priori how often this bot will do what it’s supposed to. It’s an empirical question, and an expensive one to answer. Here’s a much less antagonistic example.

I ask if it has a 2020 Bolt in stock and it says no. But they do have a 2020 Bolt in stock, it says so right there! The truth is, it didn’t check whether there’s a 2020 Bolt in stock, it just pretended to because that’s what happens in a random conversation that it drew from the space of hypothetical conversations between a user and an AI Assistant.

The random guessing nature of these things virtually guarantees that it will, at some point, output some nonsense (this is the so-called “hallucination problem”), and without knowing exactly how often this will happen and exactly what kind of nonsense it will be, it’s going to be very hard to use these in production in the way that the Maturity Staircase Of AI-Enabled Customer Service promises. And to me, it’s not clear at all that this is actually better for Quirk Chevrolet than the traditional chat bot technology which relies on older NLP technology and pre-programmed responses. It’s already been the case for years that you can build a bot that parrots canned responses to anticipated inputs. These are a bit of work to build and most people find them a bit annoying but they exist. If all you want the bot to do is tell the customer the store hours, collect their personal information, and search the inventory, you can build something that does that without having to involve OpenAI or a trillion parameter language model—and it will do a better job! Not to mention, it will cost thousands of times less money per conversation to run. It won’t offer unauthorized discounts or lie about the store’s inventory or leak its source code. It may lack the je ne sais quoi of an LLM-based chat bot, but I just do not believe that the je ne said quoi is worth the trouble.

We’re going to see a huge wave of failed OpenAI API wrapper companies founded on the axiomatic belief that Generative AI is the solution to every problem. ChatGPT for law, ChatGPT for dentistry, ChatGPT for school, ChatGPT for talking to your dog, etc. All of these will promise to solve some specific problem in some specific area on the basis that ChatGPT is a universal hammer, and most of the time it will turn out that these problems actually have idiosyncrasies which prevent a generative AI system from randomly guessing its way to their solutions.

The technology doesn’t have to be a grift

I don’t think generative AI is a grift. Generative AI systems are interesting and genuinely may offer solutions to real problems. ChatGPT is in some sense revolutionary; there’s a world before and a world after ChatGPT. It’s not 100% exactly clear how those two worlds are different, but they are different.

The grift involves pretending it’s something that it’s not, a hammer for which every problem on earth is a nail.

I think a lot of people find it easy to believe that Generative AI will eventually be the universal problem solver because they believe that a universal problem solver is inevitable, and that ChatGPT or Generative AI feel like points on a natural evolutionary progression towards that inevitability.

This picture of a linear progression from dumb to smart is just not accurate, both in the case of biological evolution and in the case of artificial intelligence. ChatGPT isn’t an inevitable next step on some smooth progression towards a genius computer. It’s a weird experimental offshoot that has been particularly successful at some surprising things. It’s more intelligent in some ways than other AI systems, and less in others. It happens to be more intelligent in a way that makes it particularly crowd-pleasing — it seems to be able to converse — but for example WolframAlpha has been better than ChatGPT currently is at mathematics for almost 15 years. The correct picture is a lot more like my messy map of AI from a few paragraphs ago than this tidy progression from dumb to smart.

It’s hard to communicate this without getting a bit deeper into technical weeds than I really want to, but chat bots as they currently exist aren’t even necessarily the best way to use the underlying technology—and they’re certainly not the only way. An LLM is a way to generate a certain kind of text. One out of many possible things you can do with something that generates text is try to make it into a conversational chat bot. It’s not at all clear that this is is the best way to use these things. It’s just something OpenAI tried for a laugh and people ended up getting really excited about it.

It’s possible that maybe in the future someone will figure out the right way to use an LLM that makes it into a truly universal hammer, or at least a more universal hammer than the one we have now. To me it seems like if there is a way forward on this, it’s by pairing the language model with something smarter that will actually make the decisions. It might make use of information contained in the language model to inform its decisions, but in general I don’t think that the strategy of leaving the decisions to the random guessing module is going to be successful at most things. By the way, this is hybrid approach is the kind of thing used in the papers I mentioned earlier in the piece about AlphaGeometry and FunSearch. These are two completely different ways of using LLMs that have nothing to do with “chatting” with them, but which uses the information contained in them along with a deterministic decision-making module to do generally interesting and useful things.

I want to make one other thing very clear about my positions here, because this is often confused. In the allegorical world where hammers have just been invented, there’s a guy writing a post about how hammers are great but they’ll never do the dishes. And in that world, his position is sometimes misinterpreted as the claim that there’s some mystical property of human beings which separates them from machines such that only a human being can do the dishes. This is not what he or I are arguing. Of course a machine can do the dishes. We just don’t think this machine can do the dishes, and that should actually not be so surprising. If we really have invented a machine that can solve every single problem in the world then I’ll eat my words but really, that’ll be the least of my concerns. But I’m sure that we haven’t.

Meanwhile, it’s worth investigating in detail which tasks are suitable for Generative AI, and which aren’t. I posit, and this should be uncontroversial, that not every task is. We should not simply assume that a task is suitable to be performed by Generative AI just because the people selling it say so. We should demand empirical evidence. Check that this stuff actually works before spending all your money on it.

As usual with these posts, I’m having trouble wrapping this up nicely, so I’ll leave you with a final screenshot from the Quirk Chevrolet AI Automotive Assistant.