Getting the Most Out of GPT-3-based Text Classifiers: Part One

Reducing Out-of-Bounds Predictions

Published in

Edge Analytics

7 min readJul 1, 2021

Source: https://thehustle.co/07202020-gpt-3/

This is part one of a series on how to get the most out of GPT-3 for text classification tasks (Part 2, Part 3). In this post, we’ll talk about a common issue, “out-of-bounds” predictions, and how to reduce or even completely eliminate it.

At Edge Analytics, we’re using GPT-3 and other cutting-edge NLP technologies to create end-to-end solutions. Check out our recent work with inVibe using GPT-3 to improve the market research process!

What is GPT-3?

GPT-3 stands for “Generative Pre-trained Transformer 3”. It was created by OpenAI and at the time of writing is the largest model of its kind, consisting of over 175 billion parameters. It was also pre-trained using one of the largest text corpuses ever, consisting of about 499 billion tokens (approximately 2 trillion characters) which includes a significant chunk of all text available on the internet.

In this example, the text in bold is the “prompt” and the rest of the text is GPT-3’s “prediction”.

GPT-3 uses a text-based interface. It accepts a sequence of text (i.e., the “prompt”) as an input and outputs a sequence of text that it predicts should come next (i.e., the “prediction” or “completion”). Through this surprisingly simple interface, GPT-3 is able to produce impressive results. The trick is designing the right prompt to extract the right knowledge encoded within GPT-3.

At the time of writing, GPT-3 is in a private beta. You have to apply for access on the OpenAI website. We recommend watching this YouTube video for a good overview of the process and some tips for getting access.

What is an out-of-bounds prediction?

This post describes some techniques we use when leveraging GPT-3 for text classification tasks. One common issue is that GPT-3 can produce outputs that are not one of the intended classes. For example, when we designed a GPT-3-based sentiment classifier to label text as either “positive”, “mixed”, or “negative”, it would it would sometimes predict “unknown”. This label makes sense semantically but is not one of the three labels we expected. We call this kind of output from GPT-3 an “out-of-bounds” prediction.

The reason “out-of-bounds” predictions can occur has to do with how GPT-3 was trained and what it is designed to do. GPT-3 is not just a text classifier; it doesn’t even have any built-in rules for it! GPT-3 was really only designed to do one thing: predict the sequence of text that is most likely to follow the prompt. It has to learn the rules of the classifier on the fly based on the prompt and any similarities or patterns from its massive corpus. Given this context, it’s understandable that it might get things wrong sometimes.

Luckily, the GPT-3 API gives us some knobs to tweak to almost entirely eliminate out-of-bounds predictions. In this blog post, we’ll cover two of them:

Adjusting the temperature.
Setting the logit_bias.

Adjusting the temperature

The temperature parameter is pretty straightforward. The GPT-3 documentation explains how temperature works: "Higher values means the model will take more risks. Try 0.9 for more creative applications, and 0 (argmax sampling) for ones with a well-defined answer." For text classification tasks, you usually want to set this to 0.

While setting temperature to 0 goes a long way to reduce out-of-bounds predictions, it doesn't completely eliminate them. To take things one step further, you can use the logit_bias parameter (more on that in a bit).

The logit_bias parameter

Tokenization: How GPT-3 sees text

In order to understand how logit_bias can be used to create a whitelist of labels, we first need to take a step back and explain how GPT-3 interprets text. GPT-3 doesn't see text as a string of characters or words, but as a list of "token ids". Each token id is a unique number which represents a short sequence of characters (on average each token corresponds to about 4 characters of English text). The process for converting text to tokens is based on the frequency of certain combinations of characters. Common words like "cat" and "ball" are likely to be assigned their own unique token, whereas less common words like "establishment" are likely to be sliced up into more than one token (e.g., "estab", "lish", "ment").

GPT-3 breaks sees text as a series of tokens, each with a unique token_id.

As to why GPT-3 uses this tokenization approach? Well, it has a few benefits. It provides a fairly dense encoding for text that computers can easily understand. Since every word can be constructed by a combination of tokens, no words are considered completely out-of-vocabulary (a problem that can sometimes plague other encoding techniques like word embeddings). GPT-3 can accept as input words that it has never seen before, and given enough context, can potentially understand them.

Setting the logit_bias parameter

The logit_bias parameter is a bit more complicated than temperature. Effectively, it can be used to tweak the probability of certain tokens being included in GPT-3’s prediction. As the documentation elaborates, “values between -1 and 1 should decrease or increase likelihood of selection; values like -100 or 100 should result in a ban or exclusive selection of the relevant token”. For most text classification tasks, we want to restrict GPT-3’s output to a whitelist of labels that we define ahead of time. In other words, “exclusive selection” is exactly what we want.

So in order to use logit_bias to create a whitelist of labels, we need to first convert our labels to tokens. For this, we’ll need to import the transformers library.

from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")def tokenize_labels(labels: List[str]) -> List[str]:
    """
    Converts a list of labels into a list of GPT-3 tokens.
    """    tokens = []
    for label in labels:
        tokens += tokenizer.encode(label)    return tokens

But there’s a catch! Because of how token-encoding works, “ positive” (with a preceding space) is represented by a different token than “positive” (without the space). To make matters more complicated, it’s not always easy to tell which representation GPT-3 will choose, since it can depend on the preceding text. We’ve found the most reliable way to account for this difference is to add both “positive” and “ positive” to our whitelist. We do this for each label.

def tokenize_labels(labels: List[str]) -> List[str]:
    """
    Converts a list of labels into a list of GPT-3 tokens.
    Adds preceding whitespace as needed in order to account for
    quirks in how GPT-3 handles tokenization.
    """    # Start with whitespace tokens
    tokens = [" ", "\n"]    # Tokenize each label by itself *and* with a preceding space.
    for label in labels:
        tokens += tokenizer.encode(label)
        tokens += tokenizer.encode(" " + label)

    return tokens

The final step is to convert this list of tokens into the dictionary format expected for the logit_bias parameter, where each token has a weight (or bias) of 100. Effectively we are telling GPT-3 to exclusively construct its prediction from this set of tokens.

def get_logit_bias(labels: List[str]) -> Dict[str, float]:
    """
    Returns a logit_bias that can be used to constrain GPT-3
    predictions to a set of pre-determined character sequences
    (i.e. phrases or words). Intended to be used for classification
    problems.
    """    tokens = tokenize_labels(labels)
    logit_bias: Dict[str, float] = {}
    for token in set(tokens):
        # Set the logit_bias for each token to 100, effectively
        # forcing GPT-3 to only choose from these tokens.
        logit_bias[str(token)] = 100    return logit_bias

The reason this technique does not completely eliminate out-of-bounds responses is that, hypothetically, there is nothing stopping GPT-3 from combining the tokens in our whitelist in unexpected ways. For example, it might combine some of the tokens for “negative” and “mixed” to output “negamixed”. In practice it is unlikely that GPT-3 will predict a high probability for made up words, and these kinds of out-of-bounds predictions are exceedingly rare (though it does depend on the exact labels used).

Summary

In this post, we covered two ways to reduce out-of-bounds predictions when using GPT-3 as a text classifier. First, set the temperature parameter to 0. Second — and this is a bit more complicated — use the logit_bias parameter to force GPT-3 to exclusively build its prediction from a predefined set of tokens.

To put everything together, let’s look at a specific demonstration of how these techniques can be used to improve the performance of a GPT-3-based text classifier. In this simplified example, we want GPT-3 to classify foods as either a “fruit” or a “vegetable”.

Without guidance, GPT-3 doesn’t know what we are asking for and its predictions are somewhat random.

By using the techniques described in this blog post, we can get the predictions we’re looking for.

GPT-3 at Edge Analytics

Edge Analytics has helped multiple companies build solutions that leverage GPT-3. More broadly, we specialize in data science, machine learning, and algorithm development both on the edge and in the cloud. We provide end-to-end support throughout a product’s lifecycle, from quick exploratory prototypes to production-level AI/ML algorithms. We partner with our clients, who range from Fortune 500 companies to innovative startups, to turn their ideas into reality. Have a hard problem in mind? Get in touch at info@edgeanalytics.io.

Getting the Most Out of GPT-3-based Text Classifiers: Part 1, Part 2, Part 3