How to use GPT-4 and OpenAI’s functions for text classification

Published in

Discovery at Nesta

6 min readAug 31, 2023

Nesta’s Discovery Hub has launched a project to investigate how generative AI can be used for social good. We’re now exploring the potential of LLMs for early-years education, and in a series of Medium blogs we discuss the technical aspects of our early prototypes.

In a previous post, we showed you how we built an application using OpenAI’s GPT-4 and Streamlit to generate personalised activities for young children that are anchored in the Early Years Foundation Stages (EYFS) statutory framework.

Continuing our exploration, we are now investigating whether appending examples of activities from trusted sources like BBC Tiny Happy People to the prompt improves the quality of the LLM’s suggestions. To do this, we first needed to map the activities on the Tiny Happy People website to the seven Areas of Learning described in EYFS.

Here, we share a technical guide on how we used OpenAI’s GPT-4 and function calling to achieve this. This approach is very general and can be used to classify texts from any trusted, third-party data source to any number of predefined categories.

LLMs for text classification

LLMs like GPT-4 have been trained on large amounts of data. This enables them to perform well in a variety of tasks without providing any examples in our prompt. This is called “zero-shot prompting”.

import openai

openai.api_key = <OPENAI_API_KEY>

content = """Classes: [`positive`, `negative`, `neutral`]
Text: Sunny weather makes me happy.

Classify the text into one of the above classes."""

openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  temperature=0.6,
  messages=[
    {"role": "user", "content": content},
  ]
)

{
  "id": "chatcmpl-7qdB0YB9mMVkCb2NUcNJ63P0MyXSC",
  "object": "chat.completion",
  "created": 1692778006,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Class: positive"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 40,
    "completion_tokens": 1,
    "total_tokens": 41
  }
}

When zero-shot prompting doesn’t work, you can add a few examples to the prompt. This is called “few-shot prompting” and has been shown to improve the LLM’s performance on the task.

import openai

openai.api_key = <OPENAI_API_KEY>

content = """Classify the text into one of the classes.
Classes: [`positive`, `negative`, `neutral`]
Text: Sunny weather makes me happy.
Class: `positive`

Text: The food is terrible.
Class: `negative`

Text: I love popcorn.
Class: `positive`

Text: This book left me a wonderful impression.
Class: """

openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  temperature=0.6,
  messages=[
    {"role": "user", "content": content},
  ]
)

{
  "id": "chatcmpl-7qdGDlPbJdnoUCwIer5B0UFhQsWF2",
  "object": "chat.completion",
  "created": 1692778329,
  "model": "gpt-3.5-turbo-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "`positive`"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 80,
    "completion_tokens": 3,
    "total_tokens": 83
  }
}

As with any machine learning task, you should start with the simplest method first and add complexity if necessary. Remember to benchmark LLMs on your task as you would do with any other model.

We found that LLMs work great for text classification when you do not have enough data to train a task-specific, supervised learning model and the amount of data you want to predict classes for is relatively small. For larger tasks, you could use an LLM to create a training set for a supervised learning model; researchers at the University of Zurich have shown that LLMs outperform human annotators on certain text annotation tasks.

OpenAI’s function calling

One of the problems that can come up frequently when working with LLMs is that their response is not always in a standardised format that can be easily parsed by downstream tasks.

For example, in our previous prototype, although we provided formatting guidelines in the prompt, the format of the response varied, especially with high temperature values.

With the latest gpt-3.5-turbo and gpt-4 models, we can describe a JSON format and force the model to output an object with all the required fields. This enables us to get structured data back from the model reliably, which is necessary for text classification (Check out OpenAI’s documentation on all the use cases of function calling.).

Let’s dive into how we classified the BBC Tiny Happy People activities to EYFS Areas of Learning using GPT-4 and used functions to standardise the output’s format.

Classifying texts to EYFS Areas of Learning

We collected text describing around 700 activities from the Tiny Happy People website. After cleaning up the data, we ended up with 620 activities with a URL, title and a long description.

To use GPT-4 for text classification, we wrote a prompt to instruct the model and a function to structure its response.

Our prompt contains the areas of learning and their description and instructs the LLM to assign the given text into one or more categories.

areas_of_learning = <TITLE_AND_DESCRIPTION_OF_EACH_AREA_OF_LEARNING>
text = <LONG_DESCRIPTION_OF_AN_ACTIVITY>

{
   "role": "user",
   "content": "###Areas of learning###\n{areas_of_learning}\n\n###Instructions###\nCategorise the following text to one or more areas of learning.\n{text}\n"
}

Functions have two required properties, name and parameters, as well as an optional one, description. `name` corresponds to how we call the function while `description` is used by the LLM to choose when and how to call the function. Parameters is a nested object that has three fields:

type: Currently it is always object.
required: An array that lists the properties that are mandatory.
properties: Defines the specific properties (or attributes) that the parameters can have.
prediction: Contains the desired output format for the LLM. It’s an array where each item is a string that can take one of the values contained in enum. enum contains the EYFS Areas of Learning and “None” so that the LLM can filter out any irrelevant texts.

{
   "name": "predict_area_of_learning",
   "description": "Predict the EYFS area of learning for a given text",
   "parameters": {
       "type": "object",
       "properties": {
           "prediction": {
               "type": "array",
               "items": {
                   "type": "string",
                   "enum": [
                       "Communication and Language",
                       "Personal, Social and Emotional Development",
                       "Physical Development",
                       "Literacy",
                       "Mathematics",
                       "Understanding the World",
                       "Expressive Arts and Design",
                       "None"
                   ]
               },
               "description": "The predicted areas of learning."
           }
       },
       "required": [
           "prediction"
       ]
   }
}

Now, we can call GPT-4 with our prompt and function to classify the following text into one or more areas of learning. Here is the full example:

import openai
openai.api_key = <OPENAI_API_KEY>

areas_of_learning = <TITLE_AND_DESCRIPTION_OF_EACH_AREA_OF_LEARNING>

text = "A fun activity for babies aged 3-6 months to help development and language learning. Try blowing bubbles with your baby and see how they react. Talk to them about what they're seeing."

content = "###Areas of learning###\n{areas_of_learning}\n\n###Instructions###\nCategorise the following text to one or more areas of learning.\n{text}\n"

function = {
   "name": "predict_area_of_learning",
   "description": "Predict the EYFS area of learning for a given text",
   "parameters": {
       "type": "object",
       "properties": {
           "prediction": {
               "type": "array",
               "items": {
                   "type": "string",
                   "enum": [
                       "Communication and Language",
                       "Personal, Social and Emotional Development",
                       "Physical Development",
                       "Literacy",
                       "Mathematics",
                       "Understanding the World",
                       "Expressive Arts and Design",
                       "None"
                   ]
               },
               "description": "The predicted areas of learning."
           }
       },
       "required": [
           "prediction"
       ]
   }
}


r = openai.ChatCompletion.create(
   model="gpt-4",
   temperature=0.0,
   messages=[{"role": "user", "content": content}],
   functions=[function],
   function_call={"name": "predict_area_of_learning"},
)

And the response:

{
  "id": "chatcmpl-7qiYqjBTRniyMboZtyG0gpNKjbv19",
  "object": "chat.completion",
  "created": 1692798704,
  "model": "gpt-4-0613",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "function_call": {
          "name": "predict_area_of_learning",
          "arguments": "{\n  \"prediction\": [\"Communication and Language\", \"Literacy\"]\n}"
        }
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 144,
    "completion_tokens": 15,
    "total_tokens": 159
  }
}

You can then parse the response to get the labels:

import json
json.loads(r["choices"][0]["message"]["function_call"]["arguments"])["prediction"]

# ['Communication and Language', 'Literacy']

What’s next?

LLMs can work pretty well for text classification, especially on tasks for which we don’t have enough training data for a supervised learning model. Paired with OpenAI’s function calling, we can reliably generate predictions in a structured format that can easily be consumed by downstream tasks.

In our prototype, we vectorised the text of each BBC Tiny Happy People activity and stored it in Pinecone, a managed vector database. We also stored the predicted areas of learning as metadata so that we could use them to filter the relevant category of activities before running a vector search.

In this way, if an educator or caregiver were to generate a personalised activity idea using our web app, we could add real-world, relevant and trusted activity descriptions to the prompt in order to hopefully improve the quality of the LLM output. In addition, we can also append the URLs of the Tiny Happy People activities to the output, so that the web app can direct the user to relevant and trusted content which is similar to their query.

In the next post, we will outline our work with LLMs and vector databases.

How to use GPT-4 and OpenAI’s functions for text classification

LLMs for text classification

OpenAI’s function calling

Classifying texts to EYFS Areas of Learning

What’s next?

Written by Kostas Stathoulopoulos