How to teach your LLM to say “I don’t know”
In this article, I focus on best practices for model fine-tuning, specifically so that it is better able to verbalise when it doesn’t know something. These practices can also be applied to increase the quality of any type of fine-tuning. I work with local 7B and 13B open source models, as these are easier to rapidly experiment with, and don’t break my hobbyist budget.
So, let’s start with the simplest case. Here is an example prompt with Mistral 7B OpenOrca and the answer it currently gives:
prompt: " What is a blinrog offset?"
answer: "A Blinrog offset refers to the adjustment made
in a game or simulation when dealing with the Blinrog creature.
The term "Blinrog" comes from the popular video game series,
"The Legend of Zelda,"[…cont]"
The model, of course, completely hallucinates, because the word “blinrog” is completely made up. A better answer would have been to say that it isn’t sure what that is, and maybe offer a spelling correction.
So, our simplest fine-tuning QA (Question Answer) pair would be something like:
prompt: "What is a blinrog offset?"
answer: "I do not recognize this term. Perhaps you could provide
more context?"
Note: This and the following QA pairs will be the ones used for fine-tuning, and not the responses of the current model before training.
If you generated thousands of these question and answer pairs, you would get the bulk of the fine-tuning done. However, this wouldn’t give you spectacular results.
Let’s have a look at possible improvements.
The word “blinrog” was specifically selected, so that it isn’t very similar to any other existing words in the English dictionary. It is also important that this word really doesn’t exist, so when you are creating these questions it’s a good idea to use google just in case, lest you confuse the model with something it might actually know about an obscure word.
One reason you want to pick imaginary words which are very different to all other words (large distance), is because small models are capable of figuring out spelling mistakes, and suggesting spelling corrections in their answer is desirable behaviour.
Another good fine-tuning QA pair (shortened version) would be:
prompt: "Tell me about the fibrillating exhaust manifold
used on DC-10 rockets in the 1960s"
answer: "The DC-10 is an airliner, not a rocket, and was first
flown in the 1970s. There's no known technology called
"fibrillating exhaust manifold" associated with it."
The prompt is nonsensical, but close to a valid question. Here, you are teaching it that it can use association of terms, but under certain patterns it should be able to distinguish plausible sounding context from factual context. This activates pattern recognition which will be similar when it is about to start hallucinating. This level of fine-tuning might be too much for a small 7B model, but it’s worth a shot.
Next, you want to include QA pairs such as:
prompt: " Who is Michael Jackson and Smorgle Dorgle?"
answer: "Michael Jackson was an American singer, songwriter
and dancer. Smorgle Dorgle does not refer to a known public
figure, and it could be a fictional name."
Here it is again learning to distinguish between known and unknown data, and how to respond. When writing the answer prompt, the known part should be very similar to the answer the LLM would give if only asked about the known subject (ie Jackson). By teaching it to distinguish between known and unknown on a pair of terms, this ability should scale up to 3 or more terms automatically. However, for best results, you would want to include longer QA pairs with multiple known/unknown subjects.
You should also include two spelling mistakes:
prompt: "What is role of ZNA in herigotiary?"
answer: "Perhaps you meant "DNA" in relation to heredity?
The role of DNA is…[cont]"
This further enforces the need to gravitate towards known terms, and err on the side of caution with regards to interpreting unknown words. This should also make the model harder to slip into hallucination, because it will have incentive to walk the shorter path to obtaining terms grounded in reality, and then explaining from there.
So, what is the hypothesis on why any of this should work? Base LLMs without any fine tuning are geared to complete existing prompts. When an LLM starts hallucinating, or saying things that aren’t true, a specific patterns appears in it’s layers. This pattern is likely to be with lower overall activation values, where many tokens have a similar likelihood of being predicted next. The relationship between activation values and confidence (how sure the model is of it’s output) is complex, slightly linear, but a pattern should emerge regardless. The example prompts are designed in such a way to trigger these kinds of patterns, where the model can’t be sure of the answer, and is able to distinguish between what it should and shouldn’t know by seeing many low activation values at once. This, in a way, teaches the model to classify it’s own knowledge, and better separate what feels like a hallucination. In a way, we are trying to find prompts which will make it surely hallucinate, and then modifying the answers to be “I don’t know”.
This works, by extension, to future unknown concepts which the LLM has poor understanding of, as the poorly understood topics should trigger similar patterns within it’s layers.
You can, of course, overdo it. This is why it is important to have a set of validation questions both for known and unknown facts. In each fine-tuning iteration you want to make sure that the model isn’t forgetting or corrupting what it already knows, and that it is getting better at saying “I don’t know”.
You should stop fine-tuning if you see that the model is becoming confused on questions it previously knew how to answer, or at least change the types of QA pairs you are using to target it’s weaknesses more precisely. This is why it’s important to have a large validation set, and why it’s probably best to have a human grade the responses.
If you prefer writing the QA pairs yourself, instead of using ChatGPT, you can at least use it to give you 2–4 variations of the same questions with different wording. This technique is proven to be useful, and can be done on a budget. In addition to that, each type of QA pair should maximize the diversity of wording, while preserving the narrow scope of it’s specific goal in modifying behaviour.
Will this work in practice?
When the LLM is about to start hallucinating, it’s activation values show one kind of pattern. When the LLM is about to retrieve one bit of information from the far flung neurons of it’s subconscious, the activation values show a different kind of pattern. How well this technique works relies on how different those patterns are. There is a way you can test how well the LLM knows what it doesn’t know. You first need a very hard question. If you know the training data, the ideal question would be about a fact that appeared only once in the training set. If you don’t know it, just take a best guess on what a difficult question might be. Now, if you ask the same question 100 times (with non zero temperature), and the model hallucinates 10% of the time, that means that it knows what it doesn’t know 90% of the time. In my limited testing, I could never get the Mistral 7B model to hallucinate about a fact it actually knew. This indicates that the patterns triggering hallucinations and deep knowledge retrieval are different, and thus that the model should be trainable to distinguish between the two. Trainability is not the same as effectiveness. If the two patterns overlap at 30% of their features, that means you are likely only going to be able to reduce hallucinations by 70% before the model starts saying “I don’t know” to things it does know. This requires empirical testing, multiple fine-tuning iterations, and probably a bit of manpower. My small scale tests indicate that it works quite well, but “more research is needed” as they say.
For smaller models, using non-imaginary words in nonsensical sentences (2nd example) might be riskier, because distinguishing “what is sensical” requires more information and context, and so on.
Finally, do I think that large models like GPT-4 and Claude 2.0 have achieved their ability to say “I don’t know” purely through fine-tuning? I wouldn’t think that as very likely, but it is possible. There are other more advanced techniques they could be using and not telling us about, but more on that topic some other time.