LLMs Build Information Antibodies

Your company’s corpus is factually inconsistent. This is a feature and not a bug.

grothendieckprime
Hardy-Littlewood
10 min readJun 9, 2024

--

“As a large language model, I cannot provide answers that contradict our Content Policy.” via ChatGPT

Many of my friends and colleagues report that their businesses are trying to install large language models. This is very exciting news. (And how far we’ve come from the last round of AI clamor!) Economic dynamism is our most important civilizational commitment and it invigorates to watch a wave of it toss our workplaces around. However, the market is already frothy and I think we will have some antibodies to build as we adopt this technology. Particularly, we (at least, the profitable firms of the future) stand to learn a healthier way of interacting with enterprise records.

I use the term information antibodies to describe the learnings that will result from LLM confusion. I like the metaphor because it encapsulates a motion of the entire immune system activating to fight an infection, which is precisely the kind of organization-wide motion that should result from a re-evaluation of data operations.

LLM World in 2024

Little boxes on the hillside. via ChatGPT

We know there’s a bubble and we know the reason. There are a lot of GPT wrapper companies out there. Many of them are near-identical productions of YCombinator, which, as we know thanks to the Sam Altman connection, is functioning to a large degree as a market research arm for OpenAI. Easy to fund a lot of seed-stage companies with $500k a pop and see what does and doesn’t work. Decentralized tinkering at its finest.

As for the adoption of LLMs itself, let’s consult the Gartner hype graph:

via Wikimedia Commons

We seem to be somewhere near the peak. The real movers and shakers are parachuting into the trough of disillusionment. Since this is a largely time-sensitive question, it’s not worth empirically disputing as much as keeping the mental model in mind.

What I’ve seen in workplaces over the last year is typically one of two things:

  1. Working with a GPT startup that tries to add generative power to tasks like copy editing or image generation;
  2. Trying to implement a LLM fine-tuned on proprietary business information to be an oracle of sorts.

I think that category 1 is straightforward asset upgrades and will result in the unemployment of a lot of copywriters and designers. They will be replaced by prompt engineers and people who can take a process-engineering mindset toward the creation of copy and marketing assets. This is normal and good creative destruction, and it should surprise no one that our labor society will routinely prefer industrial processes over craft production. In my (fellow college-educated) opinion, copywriters and designers have a plethora of skills that are economically valuable and our low unemployment rate evinces that they should be able to find other livelihoods, even if uncomfortable at first. I have several other thoughts about this and the affluent society, but I digress.

Category 2 is what interests me more. There are several examples of how LLMs are used to try to enhance enterprises:

  • At my old job, we fine-tuned GPT-4 on our engineering documentation in order to speed up engineers asking questions in Slack;
  • A friend’s trading firm fine-tunes a LLM on market news germane to portfolio performance and uses this to coach their salesforce;
  • Large firms are trying to launch copilots trained on their codebases to help engineers pattern-match and maintain a coding style.

I think there’s an obvious problem with all of these use cases, and it’s being overlooked thanks to all the froth that YCombinator/OpenAI (inter alia) have put into the market. It’s not that LLMs hallucinate, even though we seem to have forgotten this much. It’s that the corpus itself is unlikely to be factually consistent. The result of this is LLMs propagating misunderstanding.

Factual Consistency

By “factually consistent” I mean that very difficult problem for LLMs: the corpus of business data may refer to facts about the real world, and it is highly likely to include some alternative facts. Consider, a corpus could easily contain the following facts:

  • In 2022, the software does NOT run on Linux; in 2023, the software DOES run on Linux;
  • According to sales team A, the North American market did NOT increase monthly active users in response to adding Dark Mode to the UI; according to sales team B, the monthly active users DID increase as a result of Dark Mode;
  • From an engineering perspective, there is NO meaningful difference between originating a consumer loan and a business loan; from a sales perspective there IS a critical difference between these loans, and demonstrating the wrong one will kill a sales meeting.

That is to say, factual inconsistency may appear across different times, teams, or organizational purposes. We can see how a corpus that wants to be consistent has to imagine not only version-control according to time, but also has to make sure no claims are made without consensus, and that claims are made with idioms that are guaranteed to be relevant to the end user. This is, as far as I can tell, hard.

I have seen attempts at solving this problem. For instance, an engineering Q&A bot has to cite pages in the documentation and is used as a forcing function to keep documentation up to date and to motivate the creation of new documentation. This bot found itself to provide negative utility outside of the engineering team, and prompted many StackOverflow-style debates among engineers about whether it was providing solutions in a misleading way. This is to say: the actual true information by which an organization operates is heterogeneous and depends on quite a lot of moving interpretive context that should not be expected of LLMs. Discovering this can feel like fighting a disease, but it does build those information antibodies.

Messy Data Operations are a Feature, not a Bug

If you want your only LLM issues to be hallucinations, you need to be running this kind of data-quality operation. via ChatGPT.

Organizations are already great at propagating misunderstanding, and encouraging them to use the oracle is going to make this worse. Even if you tell the average employee to please check the results you get from the chatbot, they’re not going to unsee what it suggested. Behavioral psychology is replete with studies suggesting that images persist in the primed mind whether or not the listener is told that they might be misleading. So, in the worst case, a LLM trained on an inconsistent corpus of proprietary data will take an existing problem — I don’t know what’s true and I need to track down who knows — into a worse one, where no human stakeholder is even individually responsible for confusion about facts.

Okay, why not appoint such a stakeholder? Make one person responsible for auditing the corpus before it goes into the LLM and responsible for soliciting and resolving fact-correction tickets from users. In short, a data librarian. Certainly, sounds good, and I agree that for compact corpora this may be an effective solution. Librarians exist at a number of large firms today, such as Lockheed Martin, and they can be used to enforce data correctness protocols for certain types of corpora. But I think this only works in situations where stakeholders are held accountable for quality when generating data or records in the first place. For instance, if the GPT is being trained on a news feed or a custom funnel of information from various vendors, or where records have to be kept according to regulatory protocols for banks or military contractors.

The issue for most enterprises, I wager, is that information is generated and collected in ways that should not be regulated. I can’t tell you how many times I heard it proposed in my last job, as a sales engineer, that we should stalk the account executives and audit everything they’ve said in meetings, or that we should use AI to write down everything that potential customer spitballed about their requirements in some discovery meeting. Excuse me: these are creative conversations. We can take notes, but a brainstorm is not meant to be taken too seriously. Getting information out of sources like sales meetings or internal product discussions actually requires critical and eliminative thinking. This should be a clear indicator to keep the bots away: remember, LLMs are generative AI, which is not particularly good at eliminative thinking.

An example from personal experience: at my previous employer, upon discovering that the engineering Q&A bot was not suitable for sales engineers, I tried to create a different bot for our team.In principle, it would have been a huge benefit to know how clients were using the product. The sales flywheel kicks off once a product has social proof, after all. The documentation written by engineers completely misunderstood actual use cases but failed to disclaim this anywhere, so we needed to find our own corpus. This did not exist. No one had written down how clients were using the product. We thought for thirty seconds about plundering all of the recorded sales meetings our account executives had taken, and all of the postsales meetings that the customer success team had been taking, but all of these meeting notes were riddled with misunderstanding on the part of our own sales and customer success representatives. Controlling the truth of this corpus was never going to get done, simply because it was rapidly growing. Factual consistency in the corpus would be a fatal amount of overhead.

So, the result on my team was actually an earned distrust of enterprise chatbots. We built some antibodies. After enough messing with the engineering bot, we realized that the problem was actually the desire to trust a bunch of information scrounged from sales meetings. Why did we think we were going to have high-quality data for this task? More importantly, why did we want to trust random sales meeting notes for factually consistent content? There’s something about seeing that information got saved to your cloud storage that makes you want to trust it. That’s the real disease, and it may take LLMs to stimulate your team’s immune system.

Ultimately it’s a good thing to pull people quickly into the trough of disillusionment, so I recommend that organizations continue to experiment with their data. But there will be some toe-stubbing. The guy who sat next to me was constantly venting about this newfangled technology and how we really just needed to build a product that we already knew people wanted to buy. I believe a good number of companies have already stalled on their LLM-powered data operations for similar reasons. That’s fine — if they’re not able to adapt to developments in information technology, that’s free market share for someone who can. Creative destruction, baby.

How to Use Information Antibodies

via ChatGPT

If, however, you’re an entrepreneur or a change agent within an enterprise, you’re naturally going to want to succeed the first time. My intuition for stuff like this is to turn the pain point into an asset. If you actually wield the threat of a LLM project gone wrong, you suddenly give people a mandate to avoid having to do the project at all! That’s the beauty of antibodies.

Entrepreneurs can use the threat of confusion to narrow down potential solutions to those where LLMs make sense: tasks that need generative power and that can absorb an unlimited amount of human bandwidth. I suspect, over the next few years, we’ll all be working to sharpen an intuition for where to build these solutions.

Here’s my spitball recommendation for using LLMs as a diagnostic tool, plagiarized from Elon Musk’s automation strategy:

  1. Make the requirements less dumb. Use the mere threat of potential LLM confusion to simplify data operation requirements. For example: ask if a LLM should help generate copy for sales decks (scary!) or if the team should just use a static template approved by marketing.
  2. Delete the part or process. Use the mere threat of potential LLM confusion as a diagnostic tool for streamlining actually necessary data operations. Prune unnecessary sources of truth until decision-making processes have low surface area for failure. For instance: force engineering teams to deprecate bad documentation or else the bot starts using it to confuse everyone.
  3. Simplify or optimize the design. Use the threat of LLM confusion to force teams to write down their processes. For example: if you can’t tell me the allegedly-complicated process by which how you hand off requirements from presales to postsales, I will use the bot to start doing this inference for you.
  4. Accelerate cycle time. Use the threat of LLM confusion to force processes to accelerate. For instance: if your team can’t just modify the deck in fifteen minutes of heads-down concentration, we’re going to look into using the bot to do it for you.
  5. Finally, automate. If the process actually requires fine-grained searching in a corpus, takes a while, and doesn’t have an obvious rote procedure, then human bandwidth is actually being tested and the LLM may actually be a helpful generative aid. For instance: I don’t know where to start a pitch to explain the last week’s market events to a client curious about portfolio performance. A LLM might help inspire me, and checking facts is fast.

We generate a ton of information. It’s messy and that’s fine. Let’s try to treat the bots with the same level of care that we should expect of ourselves everywhere else in our professional lives. And, just like the consequences of carelessness everywhere else in our professional lives can build antibodies and corrective intuitions, there will be some mishaps with LLMs that will teach us some better habits for the information age.

--

--