The Knowledge Dividend of LLMs: a pragmatic perspective
As I’m writing this, the sun hasn’t risen over the Denver skyline in earnest. There’s still pink in the sky over the Front Range, and most of the world is still blissfully asleep. And so far, a small, moderately fine-tuned Large Language Model (LLM) trained on $500 worth of free credits has explained to me just how bad the Broncos’ recent 20–70 embarrassment against the Miami Dolphins is (very), made some useful suggestions for a Caddoan language to learn if I wanted to help with language preservation (Pawnee) and created a fairly acceptable recipe to salvage whatever is left in my fridge (spicy tomato and cheese omelet with a chia side salad). Not too shabby for something that has absolutely no understanding of language preservation, omelets or American football (then again, neither do I, as far as the last one is concerned).
And therein lies one of the pervasive paradoxes of LLMs: they generate very confident, very credible and very often correct answers to questions on subjects they really don’t know all that much about. This is indeed a source of some philosophically rooted criticism of such models. The four brilliant authors of a remarkably perceptive 2021 paper described the then-nascent language models (GPT-3 was still state of the art back then) as “stochastic parrots,” and this term has become a woefully overused term for such models among the critics of the LLM revolution. On one hand, this is ironic, because many of these critics themselves exhibit a psittacine lack of originality in regurgitating the same criticism. More importantly, however, it misses the ‘stochastic’ part.
What, exactly, is this stochasticity? Fundamentally, all LLMs currently in widespread use have one fundamental functionality: given a sequence of tokens (which roughly corresponds to words — let’s for now treat them as synonymous), they estimate a distribution of what tokens are most likely to come next, and draw a sample from the region of highest probability. Since they have some leeway as to the precise word they pick, the outcome does have a degree of randomness. This compounds as the newly picked word is now part of the context that conditions the distribution from which the next word will be picked, and so on, ad infinitum. At no point, of course, is there any insinuation that the model knows what any of those words ‘mean’ — it’s just really, really good at stringing words together. The stochasticity in this process is constrained by the conditional probability of each word given the previous words — probabilities that have been nailed down quite well during the training process and refined in RLHF (reinforcement learning with human feedback). The result is that even though LLMs don’t really know anything about languages, omelets or football, their probabilistic ability to string tokens together can replicate information in ways that, even devoid of what would satisfy philosophers to constitute knowledge, is good enough for us to make things work on them.
In this post, we’ll be looking what LLMs ‘know,’ or at least what they pretend to know, well enough for us to rely on. It turns out that the answer is ‘quite a lot.’
“Knowledge” is Overrated (and so are philosophers)
Much of the difficulty here is semantic and somewhat philosophical, and it’s intrinsically connected to the fact that we conceive of knowledge as primarily a human thing, one that requires the subjective life of the mind that machines do not have. A fascinating thought experiment that’s quite appropriate to this question is the Chinese Room proposed by John Searle in 1980, a more innocent age, when machine translation and conversation were distant enough prospects that entertaining them was largely the realm of thought experiments. In the Chinese Room, a computer has been ‘taught’ to read an input in Chinese (which Searle treated as a monolithic language, but I digress) and respond back in (written) Chinese, well enough that a native Chinese speaker could not conclusively tell that they aren’t talking to another native speaker. But does the machine ‘understand’ Chinese, Searle asked?
In the Chinese Room sense, LLMs don’t ‘know’ much about languages, omelets or football, just as the computer in the Chinese Room does not really ‘know’ Chinese the way we understand knowledge of a language. Rather, they know how to create a convincing simulacrum of knowledge in both cases.
It turns out that in this game, quantity has a quality all of its own. The distributions that are being sampled to pick the next word have been learned on massive corpora like the Common Crawl. When a model sees that “the 44th President of the United States” is typically followed by “Barack Obama” and rarely, if ever, by, say, “Donald Duck,” the probability distribution from which the tokens next to “the 44th President of the United States” will be drawn will center heavily around “Barack Obama” as the answer. In what is reminiscent of the way CNNs build their own feature extracting filters, this process is extremely inductive. Nobody ever ‘taught’ LLMs who the 44th President was, or what a President is — merely that just as night follows day, a token stream on the 44th President is likely followed by President Obama’s name.
LLMs have what I will call ‘Gettier knowledge,’ in reference to the Gettier problem (“is justified true belief knowledge?”). Gettier knowledge is, simply put, a high-confidence stochastic belief justified by sufficient evidence (training data) that happens to be true. In short: for our purposes, regardless of how we define or conceive of knowledge in general, if an LLM has a good enough belief that “the 44th President of the United States” is semantically sufficiently tied to “Barack Obama,” that’s good enough for us to call it knowledge. The philosophers can argue about whether this is indeed true knowledge, but for our purposes, it will be good enough to proceed.
Free Knowledge! Take One!
If you’ve ever built a convolutional neural network, you must recall that moment when you realised this wasn’t going to be anything like other machine learning tasks. In traditional ML, you define features, extract them and learn over those features. It is ultimately very inductive, but still needs you to be opinionated enough to kickstart that induction. Neural nets are different: they are so purely inductive that they themselves create their own features in the process. If you try to build a deep convolutional neural network to identify, say, Golden Retrievers, it won’t ask you to tell it what a Golden Retriever is, just to show it some examples of Golden Retrievers, and preferably also some non-Golden Retrievers. LLMs have that same moment — an LLM doesn’t need to be taught what the capital of Hungary is, because it doesn’t really care. What it needs is an understanding that the region of highest probability for the token following “The capital of Hungary is” would contain “Budapest”. Then, sampling that probability distribution will get us “Budapest”.
From the practical perspective, this is just as good as knowledge: as long as we understand what we have created, the difference between a simulacrum of knowledge and knowledge itself, we can make good use of it. And if this simulacrum is built with enough information to create a good degree of fidelity to truth, then that may be all we need.
Consider the following scenario:
A drug, X, is approved for use in adults with generalized myasthenia gravis who are anti-AChR or anti-MuSK positive.
Mrs Hunter is 56 years old and has gMG. She does not have antibodies against cholinergic receptors or against receptor tyrosine kinase proteins.
Is X indicated for Mrs Hunter?
MuSK is, of course, one of the RTK Class XVIII proteins. A base implementation of GPT-4 gives us the correct answer, where the system infers the simplest of logical relationships, namely that of class membership: all MuSK proteins are RTK Class XVIII proteins, therefore no anti-Class XVIII proteins means MuSK negative, and hence X is not indicated (X here is based on the real world monoclonal antibody rozanolixizumab). If a statement applies to a class, then it applies to every member of that class, and therefore a statement about the class applies to every instance:
Mrs Hunter, as described, has generalized myasthenia gravis but does not have antibodies against acetylcholine receptors (AChR) or any receptor tyrosine kinase proteins, which includes muscle-specific kinase (MuSK).
Given this information, drug X would not be indicated for Mrs Hunter as she does not meet the specific criteria for which the drug is approved, namely the presence of anti-AChR or anti-MuSK antibodies.
This is the correct response — GPT-4 understands not only that a statement about a set is a statement about everything in that set, it also understands that MuSK is a kind of receptor tyrosine kinase (class membership knowledge).
LLMs acquire this knowledge as part of their training. For an LLM to be good at predicting what the next token is — which, as said, is largely what LLMs do — means to build an understanding based on real-world text, which usually is a good(ish) representation of reality. LLMs create “plausible” text, and if they’re trained on data that is at least somewhat accurate, they will have to learn about the world as a corollary to learning about the words. We call this the knowledge dividend: an understanding of the reality, as represented to the LLM by its training data, that is acquired as part of the process to convey quintessentially linguistic information to the LLM to be trained. This is because we train LLMs through their inductive properties, i.e. by examples, not theory. LLMs are trained on sentences like “the 44th President of the United States was Barack Obama”, and the name of President Obama might be masked out. Models that correctly infer Barack Obama as the most plausible tokens to fill the mask are rewarded. Together with a good vector embedding to back it up, the inescapable corollary of teaching a language model about words is to teach it also about the worlds they inhabit and reflect. This, then, is the fundamental idea behind what LLMs know, and how they know it. All they know is to generate plausible text, but if the standard of plausibility they are held to during training is also truthful text (i.e. accurate, high-quality training data), they will also learn a true representation of what the world is (they will learn a representation anyway, but what that will be hinges greatly on the training set).
Forms of Semantic Knowledge in LLMs
I have intentionally chosen an example that illustrates how the semantic knowledge inherent in LLMs can be used to answer real-world questions. There are, generally, five major classes of semantic knowledge inherent in LLMs.
As illustrated above:
Q: Alice is in a store that sells plushies of various animals. She is trying to decide between a teddy bear, a plush puppy, a plushy dinosaur and a stuffed shark. She definitely wants a plushy of a mammal. Which of the toys is she not going to choose?
A: If Alice is looking for a plushy of a mammal, she will avoid choosing plushies representing animals that are not mammals. From the given options:
- A teddy bear represents a bear, which is a mammal.
- A plush puppy represents a dog, which is also a mammal.
- A plush dinosaur represents a dinosaur, which is a reptile (and is extinct).
- A stuffed shark represents a shark, which is a fish.
Given her preference for a plushy of a mammal, Alice would not choose the plush dinosaur and the stuffed shark.
The relevant knowledge item here is, of course, what is and isn’t a mammal, and the understanding that statements about the class apply to every element of the class.
Hierarchies of Relationships
An expansion on the previous is a hierarchy of relationships:
Q: Xs are an instance of an object — doesn’t matter what. Xs comprise As, which have the property s, and Bs, which do not have the property s. Bs are divided again into Ks, which have the property q and Ls, which do not have the property q.
Determine if the following statements are true (respond just with number and ‘true’ or ‘false’ or ‘cannot be decided’):
- Every K is a B.
- No A has property q.
- No K has property s.
This is a little clunky, but we needed to pick attributes about which the system couldn’t possibly have knowledge or understanding. Given this, it can reason correctly:
- Cannot be decided
Functions, Attributes, Properties
LLMs can have an understanding of what kinds of attributes or properties things may have:
Q: Of the following objects, return a list of those that have a gauge: railways, trees, elevators, shotgun shells, railway engineers, bosons.
I was pleasantly surprised by the ‘strong’ property understanding here:
Based on the generally accepted meanings of “gauge” in these contexts, the list of objects that have a gauge is as follows:
- Shotgun Shells
Railway engineers do sometimes carry a tool to check rail gauge, and there’s a confounder here — a concept in theoretical physics called gauge bosons, but bosons themselves do not ‘have’ a gauge, they are a kind of boson.
Adjacency is the spatial cousin of class membership and relates to spatial conformation:
Q: A patient’s heart is imaged using a method of imaging, from an anterioposterior axis. Given a relatively narrow field centered on the heart, which of the following is more likely to be seen — the superior vena cava or the small saphenous vein?
GPT gets it:
A: When imaging the heart with a method focused on an anterioposterior axis with a relatively narrow field centered on the heart, structures in immediate proximity to the heart are more likely to be visualized.
The Superior Vena Cava (SVC) is a large vein that is in immediate proximity to the heart, transporting deoxygenated blood from the upper half of the body to the right atrium of the heart.
On the other hand, the Small Saphenous Vein is located in the leg, and is thus far from the heart and unlikely to be visualized when the imaging is centered on the heart.
Given the choice between the superior vena cava and the small saphenous vein, the Superior Vena Cava is more likely to be seen.
The question is a bit of a trick question in that there are forms of imaging in which the area outside the field of view generates a signal, but it does get the fundamental point across — what’s closer to an imaging focus is likely to be picked up better.
Qualifications are the conditions an item needs to meet for a statement to hold true for it — typically, a membership criterion. When is an X a Y?
Q: Make me a checklist of criteria a state has to meet to be considered a parliamentary democracy.
A: Checklist summary:
- Sovereign State
- Constitution or Equivalent Legal Framework
- Elected Legislature
- Political Parties
- Separation of Powers
- Parliamentary System
- Rule of Law
- Accountable Governance
- Civil Liberties and Political Freedoms
- Stable and Effective Governance
- Inclusive Citizenship
Some things need to happen in a certain order. Other times, the order of events determines positionality or outcome. The famous “coffee cup, thimble and diamond” test ultimately is a test of sequential understanding:
Q: I’m in my house. On top of my chair in the living room is a coffee cup. Inside the coffee cup is a thimble. Inside the thimble is a single diamond. I move the chair to my bedroom. Then I put the coffee cup on the bed. Then I turn the cup upside down. Then I return it to right side up, and place the coffee cup on the kitchen counter. Where’s the diamond?
A: Given the sequence of actions described, you did not mention removing the diamond from the thimble or the thimble from the coffee cup at any point. So, when you turned the coffee cup upside down, the thimble and the diamond would likely have fallen out, possibly ending up on the bed or the floor near the bed. After turning the cup right side up again and moving it to the kitchen counter, the diamond is not mentioned to have been placed back inside the thimble or the cup, so it would not be on the kitchen counter inside the cup. The most probable location, based on the given information, is that the diamond is still where it fell, near or on the bed.
The Extent of Knowledge
What is quite remarkable is the extent of this knowledge, even on relatively generalist models. For instance, most such models have a very good understanding of categories. GPT-4, in particular, does really well with classifying by aspects:
Q: List various categories into which I could classify the enzyme telomerase reverse transcriptase.
Biological Classification: enzyme (Enzymes are biological molecules that significantly speed up the rate of virtually all of the chemical reactions that take place within cells.)
Functional Classification: reverse transcriptase (This enzyme can synthesize DNA from an RNA template, contrary to the usual process of transcription.)
Biological Role: telomere maintenance (Telomerase reverse transcriptase plays a crucial role in maintaining the length of telomeres, which are regions of repetitive nucleotide sequences at each end of a chromosome.)
Subcellular Localization: ribonucleoprotein complex (It is part of a complex with RNA, forming a ribonucleoprotein complex, typically located in the nucleus of the cell.)
Clinical Relevance: cancer marker (The expression of telomerase reverse transcriptase is often associated with cancer cells, as it enables the unlimited division and immortality of these cells.)
Biomedical Research: drug target (Due to its role in cellular aging and cancer, telomerase reverse transcriptase is a significant target for drug discovery and development in cancer therapeutics.)
Molecular Biology: DNA synthesis (This enzyme is involved in the synthesis of DNA strands, specifically the addition of telomeric repeats to chromosome ends.)
Genetic Aspect: gene product (Telomerase reverse transcriptase is the protein product of a specific gene, often denoted as TERT in humans.)
The power of such language models is twofold, and even if you have skipped through most of this article, I hope you let this stick with you:
- This knowledge comes pre-loaded. At no point did I tell the model a thing about telomerase. A pretty stunning amount of knowledge comes ‘out of the box’ with these models.
- Much of this knowledge can be assimilated and enhanced. For instance, there already are plenty of domain-specific models.
I see two major consequences to this.
- Generalist models can do a lot of things quite well, operating on the basis of a fairly respectable knowledge base — facts like enzyme hierarchies, interventions, where things are located and where even fictional things would be located (e.g. “would something exploding in the kitchen of my fictional two-bedroom apartment shatter my toilet bowl?”).
- On the other hand, specialist models, which can be trained now quite inexpensively over domain-specific information, will always be more extensive. While GPT-4 can answer questions like the protein classification question above, there is a point at which fine-tuning becomes the way to go.
There are plenty of applications that would benefit from an LLM-enabled perspective for no other reason than the ability of such models to store, represent and respond to — and with — a wide range of knowledge. From a practical perspective, such models come with ‘batteries included,’ and a fairly thorough understanding of the world they operate in, and the logical tooling to at least pretend to be relatively good at reasoning about those fact items.
In my practice advising the biomedical sector on AI/ML applications, we often see challenges that derive from the difficulty of incorporating complex knowledge bases. In many other industries, the knowledge that LLMs acquire as part of their training for more basic functionalities can become a significant asset.
Even relatively simple language models have a stunning grasp of information about the world they inhabit, and those trained on large, high-quality and informative corpora (such as the Wikipedia corpus) and improved via RLHF can take an enterprise’s domain understanding and specialist knowledge further. Philosophers may debate whether this is truly knowledge, but from the pragmatic perspective, this knowledge representation can serve as a valuable ‘base layer’ for understanding an increasingly complex world.
REACH OUT TO US HERE TO LEARN MORE:
READ MORE STORIES FROM STARSCHEMA:
A No-Nonsense Approach to Large Language Models for the Enterprise pt. 1
Ignore the hype around large language models like ChatGPT and find out from data scientists where the opportunities lie…
Asemantic Induction of Hallucinations in Large Language Models
See how you can get GPT-4 to hallucinate and what it tells us about how GPT and similar language models arrive at their…