Using OpenAI’s LLM (GPT-4) for natural language classification

Published in

Berylls Digital Ventures

5 min readJun 14, 2023

Perform multi-label classification without worrying about feature selection, model selection or hyperparameter tuning. All you need is an LLM and great prompting skills!

Gone are the days when a data scientist spends 90% of their time preprocessing, tokenising, selecting the right models and validating the results. The recent rise in LLMs have revolutionized the way natural language machine learning tasks could be performed. In this article, we will use the recent GPT-4 model released by OpenAI through their API to perform classification in an unsupervised way.

Introduction to the problem

In this article, the goal is to classify a list of companies into a pre-defined set of categories. We have the name, the URL and the description of the company as the input. At the time of writing this article, OpenAI’s models are unable to access the internet through the API. However, including the URL of the company can provide additional context for the model. In the future, when the internet browsing feature becomes available through the API, providing the URL provides much more context to the model.

OpenAI’s models, API keys and pricing

OpenAI has a plethora of models at our disposal. There used to be a time when a specific model performed well for a specific task. For example, davinci outperformed other models for natural langugage classification tasks, GPT-3.5 (Chat-GPT) was the best conversational AI etc. But with GPT-4, it easily outperforms all other models on most tasks.

The process of getting an API key from OpenAI remains straight-forward. GPT-4 is enabled only for the paid users and there is usually a waitlist to get access to the GPT-4 API key. The price for the API depends on the model we use and the number of tokens we input to the model. We can think of tokens as pieces of words. As a rule of thumb, 750 words equal 1000 tokens. https://platform.openai.com/tokenizer can be used to calculate the number of tokens in a text.

There are two GPT-4 models available (8K context and 32K context). The difference majorly lies in how many tokens we use in a single transaction. 8k context GPT-4 can handle up to 8000 tokens in a single transaction and this will be the model that we will use in this article. It is priced at $0.03/1000 tokens for prompt tokens and $0.06/1000 tokens for completion tokens. Prompt tokens are calculated from the text we input to the model and completion tokens are calculated from the output given by the model.

Code block

The code used to execute the request is given below. GPT-4 model requires the user to pass the messages variable as a list of dictionaries. This is not required for older models like davinci. “top_p” and “temperature” are the most important hyper-parameters that control the degree of randomness or creativity in the generated output. A lower value of temperature leads to a more deterministic output. A deterministic output means that the output produced is consistent with the input. There is less randomness or chance involved. For classification tasks, we dont want the model to think creatively. Hence the temperature value has been set to 0 (the least possible value).

The Top P parameter controls the number of tokens the model looks at before selecting the best token to output. A Top P value of 0.1 means the model will select the next best token only from the top 10% tokens, determined according to its likelihood. Click here for a more comprehensive discussion on Top P. Choosing a value of Top P =1 in our application makes sure that the model considers all words before generating an output.

Max tokens refers to the maximum tokens that can be generated by the model. In our case, we restrict it to 20 to make sure that additional unwanted tokens or explanations are not produced by the model.

def execute_llm_request(prompt):
  messages = [{"role": "user", "content": prompt}]
  response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=messages,
    temperature=0,
    max_tokens=20,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
  )
  prompt_tokens = response["usage"]["prompt_tokens"]
  completion_tokens = response["usage"]["completion_tokens"]
  price = PRICE_PER_1000_PROMPT_TOKENS * (
    prompt_tokens / 1000
  ) + PRICE_PER_1000_COMPLETION_TOKENS * (completion_tokens / 1000)
  
  return (response["choices"][0]["message"]["content"], price)

Prompt engineering

The quality of the output heavily depends on the quality of the prompt input to the model. The two most important guiding principles for better prompt engineering are as follows:

Be very clear: Use delimiters, ask for a structured output and give example outputs to the model
Give the model time to think: Specify the steps required to complete a task by giving step-by-step instructions

An example code block we used to classify the model can be found below. By giving as much information as possible, clearly telling the model what needs to be done step-by-step and specifying the output format clearly, we are able to abide by the two above guiding principles.

prompt = (
          f"{categories_and_descriptions_text)}\n"
          "And the following information about the company:\n\n"
          f"Name: '{org_name}'\n"
          f"Website: '{org_homepage_url}'\n"
          f"Short description: '{org_short_description}'\n"
          f"Description: '{org_long_description}'\n\n"
          "Strictly choose the category numbers from the categories mentioned above.\n"
          "Select up to 3 categories only if multiple categories are relevant.\n"
          "If the company does not fit into any given categories, simply output the number 0\n"
          "The Answer should be exclusively the number of the categories sepparated by comma.\n"
      )

Conclusion

By using the logic above, we were able to classify thousands of companies successfully. The classification taxonomy contains 4 levels and the result of the classification is that a company is classified to a granular level according to what they do. For example, the description of a company X reads “X designs and develops 100% battery-electric trucks for the consumer and commercial markets. The company’s primary goal is to develop technologies that will enable partners to switch from gas and diesel to electric automobiles by offering minimally compromising options” . This company was classified into Vehicles>>Battery Electric Vehicles>>Trucks.

In total, there are 200 predefined categories, comprising of 4 levels into which each company can be classified. If this task has to be done manually, it costs a fortune considering we have close to 100K companies in the database. With GPT-4, the average cost of classification comes close to $0.08 for a company. The accuracy of classification is also satisfactory.

Startup Radar

This module is being implemented inside a web app named “Startup Radar” developed by Berylls Digital Ventures. Startup Radar will be a one-stop shop for insights into the mobility ecosystem of startups and investments with curated and latest data. By classifying mobility companies into an expert-curated taxonomy, thanks to LLMs, we are able to uncover a plethora of insights and possibilities into every granular area of the mobility sector. Feel free to reach out to us through LinkedIn (Clinton Charles, Malte Broxtermann, Florian Peter, Matthias Kempf, Johan Torssell) or through the Berylls Digital Ventures Website.

A sneak peak into the “Startup Radar” app

Enjoyed this article? Feel free to give me a follow on LinkedIn for more content regarding data science and programming!