Measuring LLM Confusion
Recently while working on GenAI Observability platform, I got an opportunity to delve into various metrics to measure LLM and LLM application’s quality amongst other things. One such metric we delved and provided as part of the platform is called as “Perplexity”. I will not go into details of why it is popular and its limitations, as its covered quite extensively in other online literature.
We will cover the following in this article:
- What is Perplexity
- Computing Perplexity
- Math behind Perplexity
- References
What is Perplexity
- Perplexity is a metric to evaluate the quality of language models, particularly for “Text Generation” task type.
- Perplexity quantifies how well a LLM can predict the next word in a sequence of words.
- It is calculated based on the probability distribution of words generated by the model. A high perplexity indicates that the LLM is not confident in its text generation — that is, the model is “perplexed” — whereas a low perplexity indicates that the LLM is confident in its generation.
- While a high confidence doesn’t guarantee accuracy, it can be a helpful signal that can be paired with other evaluation metrics to build a better understanding of your prompt’s behavior.
- Perplexity can only be calculated for autoregressive language models (such as GPT and LLaMA).
Computing Perplexity
When calling a LLM for a given task type, the model response must contain token probabilities. For eg: if the LLM response is I went to the market for buying apples
, the LLM must return probabilities for each token. The API request and response to/from a LLM must meet certain requirements to be able to compute perplexity.
- When working with OpenAI-compatible API, set the
logprobs
parameter totrue
in the API request - The API response then returns the log probabilities of each output token.
Below is an example of the logprobs
in the API response. Token log probabilities are present in token_logprobs
element.
{
"id": "",
"object": "text_completion",
"created": 1281813,
"choices": [
{
"index": 0,
"text": "\nA: Yes, artificial intelligence has grown exponentially in the last decade, with advances in machine learning and deep learning providing new capabilities for applications such as natural language processing and computer vision.\n",
"logprobs": {
"token_logprobs": [
0.0, -0.8413524627685547, 0.0, -0.4566369652748108, 0.0, 0.0, 0.0,
0.0, 0.0, -0.5105117559432983, 0.0, 0.0, 0.0, 0.0, -0.42328080534935,
-0.9261689782142639, -1.7805380821228027, 0.0, -0.3008694648742676,
0.0, 0.0, 0.0, 0.0, -0.5651261806488037, 0.0, -0.5368868112564087,
-0.3304581940174103, -0.9557339549064636, -0.468206524848938, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0
]
},
"finish_reason": "stop",
"stop_reason": null
}
]
}
3. The formula for Perplexity is:
where,
N = total number of tokens
P(wi | w1, w2,…) is the probability of word wi given the previous words w1, w2,…wi-1.
4. When working against the deployed model, it gives the log probability per token in the token_logprobs
response element. Thus we compute perplexity as:
np.exp(-np.mean(token_logprobs))
AutoRegressive Model
As mentioned above in the article, Perplexity can only be calculated for autoregressive language models (such as GPT and LLaMA). These models work by generating one token at a time, based on a set number of preceding tokens.
To generate an output, it analyses the words that come before it and calculates the likelihood of different words being the next one. Then, it picks the word with the highest chance of being correct for the next part of the sentence. After that, it repeats the entire process, using the newly selected word as part of the context for the next prediction.
For example, if the preceding generated text is “Tennis is a great ” and our context length is set to 4preceding words, our model’s output distribution might look like the following:
P("sport"
| "Tennis is a great"
) = 45%
P("game"
| "Tennis is a great"
) = 45%
P("workout"
| "Tennis is a great "
) = 9.9%
P("man"
| "Tennis is a great"
) = 0.1%
Math behind Perplexity
In this section, we will derive the Perplexity formula using mathematical intuition and applying basic mathematical concepts.
Consider a LLM model being used for text generation task type. It is provided with an input prompt and the model responds with the next set of tokens i.e text. We want to evaluate how confident (or how confused) the model is in the generated response. For this we expect the model to return the token probabilities as well i.e for each token, what is its probability given the earlier set of tokens.
Lets establish the mathematical intuition of arriving at a quantified measure of model confidence/perplexity.
- We are given a list of probabilities (or log probabilities) as model response
- First intuition is to compute an average/mean of the given probabilities. This makes us arrive at a single probability number which can be interpreted as “the combined probability of the model response”.
- We use geometric mean. Consider
`P1, P2, P3, ….Pn`
as the token probabilities whereP1
= probability of first token,P2
= probability of second token given the first token and so on and`n`
= number of tokens. The formula for geometric mean is:
4. Based on the above formula, our mean probability is computed as:
5. We know that probability values lie between 0 and 1, with 0 being the worst and 1 being the best. This is depicted below:
6. Larger the confidence value implies model is more confident in its predictions. For eg: larger Pmean implies model is more confident. While, smaller the confusion value implies model is less confused about its predictions. For e.g., larger (1 — Pmean) implies model is more confused
7. We can thus arrive at formula for quantifying perplexity. However, the magnitude can be amplified by instead using the reverse of the above computed values. Thus perplexity = 1 / Pmean or
8. We will now use this formula to derive the final Perplexity formula as expressed in “Computing Perplexity” section.