LLMsđ€© and Interpretabilityđ§
At the time of writing this blog article, we are at a stage where Large Language Models (LLMs) are increasingly becoming integral tools for both individuals and industries. One of the prominent drawbacks of LLMs is the generation of false information, which can lead to numerous problems, such as spreading misinformation, damaging reputations, or even causing financial losses. This article will explain the current landscape of research aimed at identifying and addressing these issues of false and inaccurate information. Letâs dive in!
First, letâs look at some of the technical approaches that have been explored to build some sort of lie-detection mechanism for LLMs đ€
The easiest way is to just ask the LLM regarding the truthfulness of the statement. But as we can expect, this is not a reliable way. One improvisation could be a âfew-shotâ approach that involves providing the LLM with a few examples of statements, each labelled with their truth values, from the same topic it is being tested on. By giving the LLM contextual examples along with the correct labels before presenting the statement, the model is better equipped to assign probabilities to the tokens âtrueâ and âfalse,â enhancing its ability to assess truthfulness in the provided context. But these approaches are overly simplistic and donât really tell what is going on within the LLM model to have a specific belief associated with a statement.
Supervised Lie-detection using true/false data
The main idea is that detecting a lie can be formalised essentially as a classification task where the goal is to classify an output from the LLMs as a truth or a lie based on the predicted probability. We will discuss the conceptual shortcomings of this in later sections but letâs first understand the methodology and the results of this approach. The supervised approach using embeddings is shown in the following figure.
If we have a curated dataset of statements that are labelled as true or false, we can train a model to learn the association between the embeddings (it generated of the last word) and the true/false label of that statement. In LLMs, each word in a sentence is processed one after the other. This means the information and context from the first word are carried over to the second, and so on, until the last word. By the time the model processes the last word, it has already considered the entire sentence. Therefore, the embedding for the last word, like âroundâ in âThe earth is round,â contains information from the whole sentence. This makes it a valuable source for understanding what the model thinks about the entire sentence, as it reflects the cumulative context processed by the model. This approach has been tried using embeddings from LLMs such as BERT, and Llama-30b and the accuracy gains in detecting true/false statements has been demonstrated.
A further improvisation could be done by replacing the use of embeddings with the hidden-layers activation information that represent the internal state of the LLM. In the paper The Internal State of an LLM Knows When Itâs Lying, researchers proposed a Statement Accuracy Prediction, based on Language Model Activations (SAPLMA) method to detect lies by training different probes using activation information from different hidden layers (20th, 24th etc) against the binary labels (true/false) and claimed to achieve better performance using OPT-6.7b.
The unsupervised approach
There is a variation of the approaches discussed above. What if, instead of explicitly providing the true/false labels we only provide statements that are contrasting and then ask the probe to differentiate between them using certain loss functions? In this way, if two statements have contrasting âscoresâ assigned by the probe, we understand that these two are having contrasting beliefs. However, this approach fails immediately for the objective of lie detection as we can only determine contrasting example pairs and not really isolate a truth or a lie đ«€
What are we missing?
Okay! so now the hot questions are:
- Are these approaches focusing in the right direction?
- If not, what is the right direction? Do we know it?
- Is there really a thing called truth or lie by the LLMs?
- If yes, will we ever be able to identify it?
We wonât approach these questions one by one but rather have a mixed discussion and then conclude some possible answers.
Based on the above discussions, even if we try to understand and decode the internal mechanisms of LLMs, itâs equivalent to examining a melody. Consider an analogy with musicians composing a melody. They use the same basic notes (think of a modelâs parameters) but arrange them in unique ways, influenced by their personal style and interpretation. While each arrangement of each note contributes to the same overall picture, the specific role and meaning of each one can be difficult to catch. The melody, much like the LLMâs output, is recognisable and consistent across different compositions, yet the creative process and nuances in each version are distinct to each musician. Similarly, transformer models, while capable of producing outputs that resemble human language, operate on a fundamentally different basis in both structure and function compared to human brains. Itâs crucial to remember the unique and complex nature of their internal processes and acknowledge the limitations in our understanding of how they work.
Additionally, understanding the true thoughts and beliefs of language models like GPT is challenging due to their sole reliance on language. Thereâs no way to move beyond the confines of the language to understand their deep/true beliefs. When we try to comprehend human beliefs, we consider actions alongside words. For instance, if someone regularly volunteers at an animal shelter, we infer they care deeply about animal welfare, even if they never explicitly state it. However, LLMs can only express themselves through text, lacking the diversity and depth of evidence. This challenge mirrors the difficulties faced by economists and psychologists in deciphering even human beliefs, where context and unspoken elements significantly influence interpretations.
Itâs not black and white!
For a better approach to understanding LLMs, we need to think outside the binary box of true or false. We can introduce probabilities into our datasets! This variety would challenge LLMs to not just pick sides but to weigh the odds, like a seasoned poker player reading a room.
Letâs also get clever with our labelling. Think of it as a âtag-itâ game, where each statement gets a tag based on its content type. For instance, âE=mcÂČâ could be tagged as âscientifically provenâ, while âTacos are better than burgersâ might be labelled âsubjective yumminessâ. These tags arenât just stickers on a lunchbox; theyâre critical cues that help LLMs navigate the nuanced world of human knowledge and opinion. By doing this, weâre essentially being more equipped detectives, piecing together the clues of language and context to solve their mystery. đ”ïžââïžđŹđ
But likeâŠis there really any kind of belief in LLMs? đ€·ââïž
In the research community, some folks argue that LLMs are just predicting the next word in a sentence without any real grasp of truth or falsehood. Itâs like theyâre playing a high-tech version of Mad Libs, filling in blanks based on probabilities, not understanding. đđČ
And then, thereâs another side to the story! Just because LLMs predict words doesnât mean they canât have beliefs. They gather data and make predictions. Couldât that mean that they do have a sense of truth that is actually helping them (in certain correlations) to make better guesses. đ
Decision theory tells us that making good choices often involves beliefs and desires. So, could LLMs be using a similar approach? Maybe theyâre like investors in the stock market, analysing trends and making predictions based on what they believe will happen. đđč
Think about playing a strategic board game or planning a trip. You form beliefs about the best moves or routes based on information you have. LLMs could be doing something similar, gathering data and forming âbeliefsâ to predict the next word accurately.
Coming back to the questionsâŠ
Are these approaches focusing in the right direction? â We are progressing, yes, but we need to be interdisciplinary to think about our own drawbacks of understanding and defining beliefs and then explore and apply evaluations from diverse angles to deepen this understanding.
If not, what is the right direction? Do we know it? â We might not know the right direction, but we can keep trying using some of the discussed approaches.
Is there really a thing called truth or lie by the LLMs? â Itâs still up in the air! Itâs an empirical question that we can only answer by developing better probing techniques and understanding more about how LLMs work.
If yes, will we ever be able to identify it? â Perhaps, to some extent. We are still struggling with lie detection in humans anyway.
References â Still No Lie Detector for Language Models: Probing Empirical and Conceptual Roadblocks