Getting the Most Out of GPT-3-based Text Classifiers: Part Two

Fine-Tuning Prompts with Probability Spectrums

Alex Browne
Aug 4 · 6 min read
Photo by Markus Winkler on Unsplash

This is part two of a series on how to get the most out of GPT-3 for text classification tasks (Part 1, Part 3). In this post, we’ll talk about how to use a feature of the GPT-3 playground — probability spectrums — to fine-tune prompts and improve performance.

At Edge Analytics, we’re using GPT-3 and other cutting-edge NLP technologies to create end-to-end solutions. Check out our recent work with inVibe using GPT-3 to improve the market research process!

What is GPT-3?

GPT-3 stands for “Generative Pre-trained Transformer 3”. It was created by OpenAI and at the time of writing is the largest model of its kind, consisting of over 175 billion parameters. It was also pre-trained using one of the largest text corpuses ever, consisting of about 499 billion tokens (approximately 2 trillion characters) which includes a significant chunk of all text available on the internet.

In this example, the text in bold is the “prompt” and the rest of the text is GPT-3’s “prediction”.

GPT-3 uses a text-based interface. It accepts a sequence of text (i.e., the “prompt”) as an input and outputs a sequence of text that it predicts should come next (i.e., the “prediction” or “completion”). Through this surprisingly simple interface, GPT-3 is able to produce impressive results. The trick is designing the right prompt to extract the right knowledge encoded within GPT-3.

At the time of writing, GPT-3 is in a private beta. You have to apply for access on the OpenAI website. We recommend watching this YouTube video for a good overview of the process and some tips for getting access.

Using Probability Spectrums Effectively

For many tasks, good prompt design is imperative for getting good results from GPT-3. Designing a good prompt can sometimes feel like a guessing game, and it’s not always clear why changing the wording of the prompt can increase or decrease performance. One technique that we can use to help improve our prompts involves looking at the full probability spectrum in the GPT-3 playground.

Let’s consider the following GPT-3 prompt. This a fairly minimal prompt for tweet sentiment classification and includes several examples.

Label the tweets as either "positive", "negative", "mixed", or "neutral":
Tweet: I can say that there isn't anything I would change.
Label: positive
Tweet: I'm not sure about this.
Label: neutral
Tweet: I liked some parts but I didn't like other parts.
Label: mixed
Tweet: I think the background image could have been better.
Label: negative
Tweet: I liked it.

Enabling probability spectrums in the GPT-3 playground is simple: use the drop-down menu to set the “Show Probabilities” option to “Full Spectrum”. This adds color-coded highlighting to all the text in both the prompt and the predicted completion. This can be interpreted as GPT-3’s answer to the question “If the prompt stopped here, what would be the next predicted token?” for each token in the prompt.

The “Show probability” option can be found in the bottom right corner of the playground.

With this option enabled, the playground will display the probability for each token in the prompt by highlighting the text with different colors. Green highlighting means the token has a high probability whereas tokens that are highlighted red have a low probability. The different hues in between representing the full range of probability between 0% and 100%. We can also hover over each token to see the breakdown of token probabilities in more detail.

GPT-3 considers the label “positive” to have a probability of 58%.

For the first example, hovering over the word “positive” reveals what GPT-3 would predict if the prompt were to stop here. In other words, if we didn’t tell GPT-3 the answer, what would it guess? With a probability of ~58%, it would’ve predicted “positive”, which is correct. One additional note about this first example: we can see from the probability spectrum that at this point GPT-3 isn’t sure what capitalization to use and whether or not the predicted label should be in quotes. These uncertainties will go away in subsequent examples once a pattern is established.

Here, GPT-3 considers the label “neutral” to be less probable at ~34%.

GPT-3 is less certain about the second example, where it would have predicted “negative” with a probability of ~34%. It’s not clear that GPT-3 is objectively wrong to predict “negative” in this case (it is at least arguably correct). One way to interpret the result here is that the example is providing new information to GPT-3. By including this example in the prompt, we are homing in on what exactly we mean by “Label the tweets as either ‘positive’, ‘negative’, ‘mixed’, or ‘unclear’”. This specific example may be showing GPT-3 that tweets that indicate uncertainty should be considered “neutral”.

For the third example, GPT-3 would have predicted “mixed” with a probability of ~49%. While this is the label we were looking for, GPT-3 is not very certain. The probability for “negative” is ~45% which is very close to the probability for its top choice. We can infer that this example is again providing useful information to GPT-3. We are showing GPT-3 that tweets which contain both positive and negative sentiment should be considered “mixed” (or at least re-affirming it).

In the fourth and final example, GPT-3 would have correctly predicted “negative” with a probability of ~98% — much higher than for the other examples. We can infer that this last example is not really providing any new or useful information to GPT-3, since it already knew the correct label.


Our standard process for using GPT-3 for new tasks often involves a process very similar to what is described above. The probability spectrums give us some hints as to which examples could potentially be removed and which examples should be elaborated on. For example, based on the results above, we might try removing the fourth example since it does not seem to be providing any new information. It might also be effective to include more examples of “neutral” and “mixed” tweets since those seem to be the categories that GPT-3 was the least certain about.

Probability spectrums are undoubtedly a useful tool for fine-tuning prompts, but it’s important to keep in mind that this technique can only offer us hints about what to change. After making any changes, it is important to rigorously test the new prompt in order to confirm whether our inferences are correct and measure the impact on accuracy and performance. Several cycles of prompt adjustment and testing are often needed in order to yield the most optimal results.

GPT-3 at Edge Analytics

Edge Analytics has helped multiple companies build solutions that leverage GPT-3. More broadly, we specialize in data science, machine learning, and algorithm development both on the edge and in the cloud. We provide end-to-end support throughout a product’s lifecycle, from quick exploratory prototypes to production-level AI/ML algorithms. We partner with our clients, who range from Fortune 500 companies to innovative startups, to turn their ideas into reality. Have a hard problem in mind? Get in touch at

Getting the Most Out of GPT-3-based Text Classifiers: Part 1, Part 2, Part 3

Edge Analytics

Solutions at the intersection of bits and atoms