Getting the Most Out of GPT-3-based Text Classifiers: Part Three

Label Probabilities and Multi-Label Output

Alex Browne
Edge Analytics
7 min readAug 25, 2021

--

Photo by Louis Hansel — Restaurant Photographer on Unsplash

This is part three of a series on how to get the most out of GPT-3 for text classification tasks (see part one on reducing out-of-bound predictions and part two about fine-tuning prompts with probability spectrums). This post will describe a technique for producing multi-label outputs and improving classification results.

At Edge Analytics, we’re using GPT-3 and other cutting-edge NLP technologies to create end-to-end solutions. Check out our recent work with inVibe using GPT-3 to improve the market research process!

What is GPT-3?

GPT-3 stands for “Generative Pre-trained Transformer 3”. It was created by OpenAI and at the time of writing is the largest model of its kind, consisting of over 175 billion parameters. It was also pre-trained using one of the largest text corpuses ever, consisting of about 499 billion tokens (approximately 2 trillion characters) which includes a significant chunk of all text available on the internet.

In this example, the text in bold is the “prompt” and the rest of the text is GPT-3’s “prediction”.

GPT-3 uses a text-based interface. It accepts a sequence of text (i.e., the “prompt”) as an input and outputs a sequence of text that it predicts should come next (i.e., the “prediction” or “completion”). Through this surprisingly simple interface, GPT-3 is able to produce impressive results. The trick is designing the right prompt to extract the right knowledge encoded within GPT-3.

At the time of writing, GPT-3 is in a private beta. You have to apply for access on the OpenAI website. We recommend watching this YouTube video for a good overview of the process and some tips for getting access.

Interpreting Label Probabilities

There’s one more technique that we use at Edge Analytics to get the most out of GPT-3, and it has to do with how we interpret the API output. The GPT-3 API doesn’t just return the predicted text; it also returns some metadata about the prediction, like the duration for the request and the exact version of the underlying model that was used. Of particular interest to us is a field called top_logprobs, which contains the natural log of the probability for each possible token for each part of the prediction. The format of the top_logprobs field makes it a little hard to interpret at first glance. However, with some additional processing steps we can use it to improve accuracy and convert from single-label output to multi-label output.

Let’s look at a specific example. We’ll use the following prompt:

Classify each of the following foods as either a fruit or a vegetable.
Food: apple
Label:

If we just look at the probability spectrums in the playground, we see that GPT-3’s top prediction is “fruit” with a probability of ~33%. It might not seem like GPT-3 is very confident about the predicted label. However, there’s more to it than that. As we’ll soon see, the reason that this probability is relatively low has to do with formatting. GPT-3 is not sure whether the label should be capitalized, whether it should be proceeded by a space, or whether it should be surrounded by quotation marks. In reality, “fruit”, “ fruit”, and “Fruit” have the same semantic meaning, just with minor formatting differences. We should try to account for similarities such as this in order to extract the most accurate prediction possible.

Here are the first three elements in the top_logprobs array (while this is sufficient for our example, in some cases it may be necessary to look at more than just the first three elements):

"top_logprobs": [
{
"\n": -6.6079745,
" vegetable": -4.0431094,
" ": -6.2982974,
" Fruit": -1.9344363,
" Veget": -4.888889,
"Ve": -10.074649,
"ve": -9.217822,
"fruit": -6.2173643,
"F": -8.124118,
" fruit": -0.1926153
},
{
"\n": -0.0059370887,
" vegetable": -6.9535685,
" ": -5.4001474,
" Fruit": -11.35315,
" Veget": -8.579499,
"Ve": -12.559547,
"ve": -11.902375,
"fruit": -12.914117,
"F": -14.043984,
" fruit": -8.356797
},
{
"get": -6.502035,
"\n": -0.4551382,
" vegetable": -8.820861,
" ": -4.151076,
"Ve": -4.243354,
"ve": -5.310599,
"fruit": -5.189734,
"F": -1.1317282,
"able": -8.674933,
" fruit": -7.591307
}
]

First, let’s convert these numbers into a format that we’re more familiar with. The documentation doesn’t exactly spell it out, but as you might have guessed the numbers in top_logprobs are simply the natural log of the probability. Converting them to the probability between 0 and 1 is straightforward:

import numpy as npdef logprob_to_prob(logprob: float) -> float:
return np.exp(logprob)
>>> logprob_to_prob(-1.9344363)
0.14450570373415098

Our approach gets a little more complicated from here. We developed a recursive algorithm for iterating through each element in top_logprobs and adding up the corresponding probabilities for each label. Let's start with a function which just tells us the probability for a single label based on the given logprobs.

def prob_for_label(label: str, logprobs: List[Dict[str, float]]) -> float:
"""
Returns the predicted probability for the given label as
a number between 0.0 and 1.0.
"""
# Initialize probability for this label to zero.
prob = 0.0
# Look at the first entry in logprobs. This represents the
# probabilities for the very next token.
next_logprobs = logprobs[0]
for s, logprob in next_logprobs.items():
# We want labels to be considered case-insensitive. In
# other words:
#
# prob_for_label("vegetable") =
# prob("vegetable") + prob("Vegetable")
#
s = s.lower()
if label.lower() == s:
# If the prediction matches one of the labels, add
# the probability to the total probability for that
# label.
prob += logprob_to_prob(logprob)
return prob

Note that sometimes the tokens in logprobs are not exactly equal to the label, but rather they are prefixes of the label. This is where recursion comes in. For example, if the token in logprobs is “Vege”, we re-frame the problem by iterating through the rest of logprobs and searching for the rest of the label (in this case “table”).

def prob_for_label(label: str, logprobs: List[Dict[str, float]]) -> float:
"""
Returns the predicted probability for the given label as
a number between 0.0 and 1.0.
"""
# Initialize probability for this label to zero.
prob = 0.0
# Look at the first entry in logprobs. This represents the
# probabilities for the very next token.
next_logprobs = logprobs[0]
for s, logprob in next_logprobs.items():
# We want labels to be considered case-insensitive. In
# other words:
#
# prob_for_label("vegetable") =
# prob("vegetable") + prob("Vegetable")
#
s = s.lower()
if label.lower() == s:
# If the prediction matches one of the labels, add
# the probability to the total probability for that
# label.
prob += logprob_to_prob(logprob)
elif label.lower().startswith(s):
# If the prediction is a prefix of one of the labels, we
# need to recur. Multiply the probability of the prefix
# by the probability of the remaining part of the label.
# In other words:
#
# prob_for_label("vegetable") =
# prob("vege") * prob("table")
#
rest_of_label = label[len(s) :]
remaining_logprobs = logprobs[1:]
prob += logprob * prob_for_label(
rest_of_label,
remaining_logprobs,
)
return prob

At this point all we need to do is call prob_for_label for each of our labels: "vegetable", and "fruit". What we get at the end of this process is a multi-label output, a probability between 0.0 and 1.0 for each possible label:

{
"fruit": 0.78,
"vegetable": 0.05,
"other": 0.17
}

Summary

In this post, we described a technique for using logprobs to improve the accuracy of GPT-3-based text classifiers. In the example above, this technique gives us a more complete picture of the probability for the “fruit” label. The naive approach estimated the probability to be 33%, which seemed to indicate that GPT-3 was not confident about the prediction. However, our new technique shows that the probability is closer to 78%, which tells us GPT-3 is fairly certain about the answer.

The second major benefit is that this technique gives us multi-label outputs, which otherwise are not provided directly by the GPT-3 API. We can use multi-label outputs to gain insight into what GPT-3 is really thinking and further refine our predictions. For example, for sentiment analysis if both “positive” and “negative” have relatively high probabilities, it might be a good indication that the most appropriate label should really be “mixed”.

Note that this example and the corresponding code have been simplified for demonstration purposes. In the real world, we have more edge cases and more complicated labels to deal with. This technique makes an even bigger difference for longer labels or labels that contain the same prefix (e.g. “somewhat positive”, “somewhat negative”, “mostly positive”, and “mostly negative”). In such cases, naive approaches to estimating label probability can sometimes produce answers that are flat out wrong, and the recursive technique we described is critical for ensuring accuracy.

GPT-3 at Edge Analytics

Edge Analytics has helped multiple companies build solutions that leverage GPT-3. More broadly, we specialize in data science, machine learning, and algorithm development both on the edge and in the cloud. We provide end-to-end support throughout a product’s lifecycle, from quick exploratory prototypes to production-level AI/ML algorithms. We partner with our clients, who range from Fortune 500 companies to innovative startups, to turn their ideas into reality. Have a hard problem in mind? Get in touch at info@edgeanalytics.io.

Getting the Most Out of GPT-3-based Text Classifiers: Part 1, Part 2, Part 3

--

--