Avoiding LLM Hallucinations — data vs. metadata

Dean Allemang
13 min readMay 6, 2023

Like many of you, I’ve been reading a lot lately about ChatGPT and LLMs, and how they will impact various aspects of our society. Some of what I have read is pessimistic, and focuses on all of the harm that can come from this new technology, and a lot of it is optimistic, about all the capabilities that could be realized. And quite a lot (this is probably because I hang out in groups where people are interested in their own pet technology) has to do with how the weaknesses of LLMs can be addressed by the use of their favorite technology. Since I have been an advocate for Knowledge Graph technology for many years, I am as guilty of this as the next writer, of thinking that the key to making LLMs useful is to combine them with Knowledge Graphs.

In my blogs so far, I have reported on a number of experiments where I’ve made some connection from LLMs (ChatGPT in particular) to knowledge graphs; but just as often, I’ve used ChatGPT in some other way, for example, to provide illustrations of the points I was making that don’t necessarily have anything to do with LLMs.

Why LLMs Fail

Well, when they do fail. I’ve seen them succeed a lot.

A trend I have seen in all camps is an assumption that the way users in the future will interact with LLMs is through a chat interface; through some extension of the Chat interface we see with ChatGPT. You ask the LLM questions, and it gives answers, then you take some action based on those answers. This immediately shows up the problem that if the LLM provides factually incorrect answers (as the chat page on ChatGPT warns it might), then the user might take inappropriate action. This effect is often referred to as the LLM “hallucinating”.

Starting from this assumption, there are a number of approaches to mitigating this effect. The most obvious is to train the LLM on the facts that you would like it to know. If you want to ask questions about banking, train it on banking. If you want to ask questions about programming, train it on programming. This approach is fraught with problems. First off, if you have lots and lots of facts, this is a lot of training, which is expensive. Second, no matter how much you train it, it can still hallucinate. You will always have to check its work. The apparent futility of this approach is often what fuels pessimistic forecasts of LLMs’ utility.

Another approach has been compared to a “cheat sheet” in an exam. Imagine a professor sets you an exam, but allows you to write whatever you want on a single page of paper before you enter the exam room. You get to prepare that paper any way you like, but you can’t go beyond that page. This is how some styles of LLM prompt engineering work; you include background material that answers the question you want to investigate, provide the question as well, and let the LLM fish out the information it needs. This works pretty well, but prompt size limits (which corresponds to the size of that one-page “cheat sheet”), make it tricky to get this to work. Systems like Langchain generalize this idea, so that the information to be included could actually be negotiated by the LLM itself, making it possible to refine the prompt. This is pretty promising, but still relies on the LLM to give the answers, and runs the usual risk of hallucination.

I have done a solution like this for my book, “Semantic Web for the Working Ontologist”, whereby I include snippets from the book in the prompt, and use encodings to figure out which ones to include. I have also tried to do this with the data.world API, but I have found that GPT 4 is already familiar with the data.world API, so much so, that there was really no need to enhance its knowledge. I haven’t really found the performance of this “cheat sheet” enhanced chat bot to be that much superior to the performance of ChatGPT itself.

Using LLMs to Present Information

Let’s take a different approach to this problem, where the problem is how difficult it is to trust the results coming from an LLM Chatbot. I want to suggest that the real issue here is that we are expecting the chatbot to provide reliable access to data, and that’s not what a chatbot is good at. But we have a lot of systems that are good at this; spreadsheets, relational databases, document stores, graph stores, triples stores. Instead of asking an LLM to answer a question, we can use it to help us present information.

I have found ChatGPT in particular to be very good at programming in APIs. If you tell it, say, to show something on a heat map, it can find mapping software and put your data on it. I have shown that in a blog entry Evaluating linked data using the Good Growth Plan. Or you can tell it to show you something as a graph, like I did in this blog entry Linking data with ChatGPT, and Converting Graphs to Tables. Because of ChatGPT, it is now possible to share information in a number of modalities, without having to learn a lot of API details.

This might seem superficial, but I think this is quite the game-changer. Up till now, there have been any number of proprietary data presentation systems that have their own communities of expertise for displaying data in a number of ways. Learning to use them effectively involves intense study, and has made data presentation into more of a technical craft than an endeavor of expression. I’m pretty sure that the ability of LLMs is going to change that landscape, and really democratize data presentation.

A question that many people have when I tell them that I had ChatGPT write display code for me is to wonder how I know what the program does? Is it correct? Does it contain malware? Will it fail in dramatic ways? In the case of programs, you can get past a lot of hallucination issues by just running the code. Yes, the LLM hallucinates; for example, it sometimes pretends that the API has an entry point that doesn’t exist. But your compiler/interpreter catches that right away, when the call fails.

In a programming setting, hallucination isn’t as serious an issue as it is in general, because the programming setting allows the answer to be evaluated; either by a compiler, by testing the code, or even by just running it and seeing what it does. You can even peruse the code before making any decisions that rely on its output.

From Data to Metadata

If an LLM isn’t built to handle data, how about metadata? Instead of trying to train up an LLM on enough data to be accurate when it answers questions, or to try to include enough relevant data in a cheat sheet so it gets the right answer, suppose we provide it with metadata, that is, tell it about data resources? This has a lot of advantages right off the bat; first off, metadata is much, much smaller than its corresponding data. A lot of the issues of getting that information into the LLM are a lot easier when we’re just talking about metadata. You can even fit a pretty elaborate database schema into a pretty modest cheat sheet.

From Data to Metadata

One of the ways to do this is to go through a data catalog. You don’t need an elaborate data catalog; even just having some tables with fairly ordinary names, and a list of the column names that go with them, along with (as available) a description of the database from its provider, an LLM can bring a lot of its business knowledge to bear to figure out what is in each table, and each column. I’ve written a blog about that, too Summarizing data with ChatGPT. The results are pretty impressive, and it was easy to do with ChatGPT.

If we think about these applications of LLMs, that is, where we don’t expect the LLM to provide factual answers, but instead to provide information about the datasets themselves (hence calling these “metadata” solutions), we can consider the ramifications of hallucinations. If an LLM hallucinates when you ask it a question of fact (a “data” question), and it hallucinates, you’ll get a bad answer that could be misleading and cause catastrophic results if you take action directly on that information. When you use an LLM for a metadata application, what happens if it hallucinates? It might make reference to a table or column that doesn’t exist. Just as in the case with using an LLM for data display programming, this sort of error is easily detected, but just checking against the database. If it makes an egregious mistake in summarizing metadata, e.g., mistaking a debit for a credit, it doesn’t result in a balance change; it just results in some confusion when someone tries to understand the data and uses the LLM advice to do this. Someone who knows the data can correct it themselves, or, as we’ve seen in programming examples, just tell the LLM that it made a mistake and have it fix it. This relates to the idea of Telling vs. Figuring out ; the LLM figures some things out, and tells the world; another user can figure out that some of them are wrong, and fixes it. A combination of crowdsourcing and AI, mediated in a collaborative data catalog.

We can go a step further. As we saw in Summarizing data with ChatGPT, we can ask the LLM to suggest business questions for a dataset. What does it even mean to hallucinate in such a circumstance? The value from a suggested business question is at least twofold; first, it gives a business user an idea of what the data is about, to help them evaluate what they might want to learn from it. Second, it provides some lateral thinking spur to creativity to help a business user understand what decisions lie before them. Often, figuring out a question is harder than answering it (as we learned from the Hitchhiker’s Guide to the Galaxy). Mistakes in this arena might even be beneficial, since they encourage business modelers to consider capabilities that go beyond what it obvious (or even available) in the dataset.

LLMs and Question Answering

This suggests a whole different paradigm for question answering with an LLM. Instead of having the LLM answer a question directly, it provides a query that will answer the question. Once we have business questions, it is a simple step to asking ChatGPT to answer them. Here is a transcript of a session from a simple chat loop I put together. The loop starts by pointing at a dataset in data.world; in this case, I’m using the original Syngenta dataset I created a few years ago about the Good Growth Plan. The loop just takes a question from the user, asks ChatGPT to create a SPARQL query that will work on the data.world encoding for a spreadsheet, runs the query, and, in the case where the query fails, tries again until it gets an answer.

What question do you have about the dataset? what crops are tracked by the study? 



crop: Barley
crop: Maize
crop: Cocoa
crop: Tomato
crop: Potato
crop: Wheat
crop: Rice
crop: Sunflowerseed
crop: Grapes
crop: Oilseedrape
crop: Pepper
crop: Cauliflower
crop: Apple
crop: Pear
crop: Coffee
crop: Soybean
crop: Sugarcane
crop: Banana
crop: Cotton


What question do you have about the dataset? Which countries saw a decrease in farms reporting about coffee from 2014 to 2015?



country: Vietnam
country: Brazil


What question do you have about the dataset? Which country has the highest average insecticide efficiency for Rice?



Country: Philippines
HighestAvgInsecticideEfficiency: 0.140596594

The first query requires the LLM to figure out what it means to report on a crop, and figure out how to query that. This isn’t very difficult, since it is really just asking for values from a single column, but it isn’t trivial, since we have to figure out which table and column to use. Once that has happened, the user can ask a follow-on question, based on the description of that table. Sometimes this has required a bit of negotiation on the part of ChatGPT and SPARQL; in this case, it did not. The questions can be pretty involved; one of the things that the data reports on is the efficiency of the use of various types of chemicals, including insecticides. The question about highest average efficiency for a particular crop (“Rice”) by country requires the query to find the efficiency measures, filter them by crop, group them by country, average them, and then compare them. In the response, it also (without being prompted) included the actual value of that efficiency. I guess 14% is pretty good for this crop.

To be clear; we did not expect GPT to have data about crop efficiency, so there is no way it could be hallucinating about this fact. The data comes from data.world; the only thing that GPT did was to figure out how to query it. That query can be checked for a certain degree of fidelity automatically, with a standard SPARQL engine. We can check that the query is syntactically valid, and that it refers to columns that actually exist. We can even do type checking; if we do arithmetic on arbitrary strings, we’ll get an error that the loop can handle. But we can’t check, for example, that GPT didn’t compute a maximum instead of an average, without reverse engineering its query. But as it happens, this is the sort of hallucination that is pretty rare (as in, I’ve never seen it) in ChatGPT. But there is no chance that the result in the transcript (0.140596594) is a hallucination; that didn’t come from ChatGPT, it came from data.world.

Writing SPARQL queries seems to be pretty easy for an LLM. We’ve seen another example of ChatGPT writing successful SPARQL queries in one of my blogs, LLMs Closing the KG Gap

Super Technology To The Rescue!

I have read many accounts recently that say that an LLM is really good, but it needs a knowledge graph to complement it; that a knowledge graph can provide the information that an LLM can’t, and can provide access to verified data. But I have read a lot of these things because I read a lot of stuff about knowledge graphs. I also read about mathematics, but to a lesser extent, but I have also seen Stephen Wolfram argue compellingly that what an LLM needs is a mathematically sound reasoning system to complete its knowledge (after all, you can prove when a mathematically sound system provides a correct answer!). I don’t read a lot about database systems, but I bet there are accounts that say that the thing an LLM needs to complete it is a data lake, or a data mesh, or even a data warehouse or just a database. Everyone thinks that their pet technology is the key that will make LLMs really better.

Supertechnology To The Rescue!

So what is the best way to effectively augment an LLM’s capabilities using an external system? I come to this from the Knowledge Graph world, so my inclination is to say that the right way or best way to augment an LLM is to connect it to a Knowledge Graph; that’s even a theme in my blog entry, LLMs Closing the KG Gap.

But if we look at the query transcript above, that really just used SPARQL to query tabular data. We’re using a knowledge graph language, SPARQL, but we’re really using just the capabilities of a database in this example.

So how can a knowledge graph help this? First off, notice that we needed some information about the tables and the columns to make this work. This information is commonly available for spreadsheets, but not so much so for relational databases. In order to do this, we need to have information about the metadata of our data sources. This means we have to be able to have annotations on metadata. While is isn’t necessary to have a knowledge graph to get this information, it really helps. We would like to be able to crowdsource knowledge about our datasources, and use that crowdsourced metadata to inform a query interface like this one.

It might turn out that the availability of LLMs mean that low-knowledge data sources will be connectable in ways that were not possible before, effectively turning clouds of databases into knowledge-based data meshes. If this is true, it is indeed a step forward in the ability for humankind to manage knowledge on a very large scale. The goals of the Semantic Web could be achieved without the need for formal standards of data sharing.

As you might imagine, I don’t think this is the case. The key to this has to do with my blog about Figuring out vs. Telling. A key feature of a graph of any sort, and especially a knowledge graph, is knowing when two sources are referring to the same thing; this is often known simply as “the identity problem”. There are a lot of approaches to this problem with varying rates of success in different circumstances. If we have an LLM generate a query, it will have to implement one or more of these approaches in order to get a 360 degree view of the data. With a knowledge graph, we can have one agent “figure this out” and then “tell” it to the next agent, effectively turning the system into a collaborative space where the utility of knowledge increases with every interaction. This is where a knowledge graph shines. The ability to combine crowdsourcing — including contributes from LLMs, algorithmic analyses and human contributions — all in one system.

Human-AI collaboration

I won’t claim that I’ve got it all figured out, but it seems pretty clear that a combination of crowdsourcing, metadata and LLM’s ability to generalize metadata and program queries can really revolutionize how we interact with data, knowledge and each other. Each of the experiments I reference here illustrates a part of this, working reliably. Let’s see if we can scale it up to really change how we relate to knowledge.

--

--

Dean Allemang

Mathematician/computer scientist, my passion is sharing data on a massive scale. Author of Semantic Web for the Working Ontologist.