Lab Notebook: Chatbots as text classifiers. How good are they really?

Published in

Cybersecurity for Democracy

9 min readSep 15, 2023

Like many small teams, we at Cybersecurity for Democracy face labor resource constraints that limit our ability to label data. Despite this, we want to ensure as much flexibility to pursue different data investigations as possible. Here, I mentioned generative models like LLMs (large language models) as a potential solution to this problem. They have previously exhibited strong zero- and few-shot performance (performance with little to no examples of the task) across many tasks including online harm identification.

We evaluated this approach for several tasks in our workflow, and found vast differences in performance between related text classification tasks, none of which are strong enough for unsupervised deployment. Promisingly, we were able to distill the knowledge of a larger, more performant model like GPT-3.5 to a smaller model to achieve 98% of the performance at <1% of the cost.

The code for this analysis can be found here.¹

Task Setup — Gathering Content

In our work, Cybersecurity for Democracy handles large volumes of varied (organic and paid), multilingual, and multi-platform social media data. One of our data ingests collects all Facebook posts by US-based media organizations. There are many questions we could ask of this data. One with important consequences is what factors influence how much engagement posts get and when (and if) they get removed. Many factors that have an influence on engagement and potential moderation are in the related metadata that comes with the content(e.g. language, attached media, account verification status etc.) However, there are others that we need to infer from the content of the post such as topic and partisanship. So we need classifiers for these additional important features.

Topic

We manually reviewed a small sample of 100 posts and inductively grouped them into topic categories. These categories are used to structure the task and evaluate performance. These groupings are government/politics, sports/fitness, business/economics, arts/culture/entertainment, crime/public safety, school/education, and miscellaneous to catch posts that do not fit into the other ones.

Partisanship

Our previous research has shown big differences in user engagement based on the partisanship of the source. We previously acquired news annotation data from Media Bias Fact Check and NewsGuard. With their teams of journalists, they rate the information reliability of websites. After consolidating the ratings across the two sources and merging them onto Facebook IDs where given or discernible, we have 888 Facebook accounts in the data with partisanship ratings. We reduced the original ratings down to left, center, and right to simplify the task. These annotations were determined at the publisher-level from a holistic content review. To emulate this, we feed the model a concatenation of the three most-engaged posts during the time period for each Facebook account. The shows the model a sample of the content the account produces.

Prompt Format

The structure of prompts, the text presented to the model, greatly impacts the models’ outputs. We tested several different constructions to see which yielded the best results and settled on the formulations in the table below for the general task description which was reformatted and given role tags matching the prompt structure of a given model being tested. Given the large search space and nascence of the field, much remains unknown about prompt engineering and more experimentation is needed. You can learn more about prompting through this course and try it out here.

Picking a Model

Generalized performance benchmarking of LLMs is difficult because performance varies widely based on prompt structure, number of examples, tasks etc. The obvious choice given the media buzz might be a model in OpenAI’s GPT line (or one of its commercial competitors like Google’s Bard and Anthropic’s Claude). As of now, commercial models outperform open-sourced ones across many tasks, but the proliferation of open-sourced models in the wake of the GPT-4 announcement might change that.

That said, there are several reasons why an open-sourced model may be preferred including data security and model transparency. In an update on March 1, 2023, OpenAI claims to give users the option to opt their API calls out of model training data, but does keep records of it for 30 days to monitor misuse. For highly sensitive data, even that length might be too high a risk.

While the GPT-4 API allows for some level of customization via fine-tuning or hyper-parameter specification (e.g. temperature), researchers have no ability to inspect model weights or training data. It operates as a black box. Open AI said that much in their Technical Report:

Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.

Open-sourced models mitigate these concerns as they can be fully inspected and tweaked and self-hosted. The HuggingFace and GPT4All leaderboards track the best-performing open-sourced models. From these two lists, we select the best performing small (<13b parameters) models with sequence classification heads to compare against gpt-3.5-turbo: NousResearch’s Nous-Hermes-13b, CarperAI’s Stable Vicuna, Wizard Vicuna, and Nomic AI’s gpt4all-13b-snoozy .

Model Setup

A few of these models are based on the LLaMa architecture released by Facebook. The official setup process for those is:

Fill out Meta’s LLaMa download form
Convert the LLaMa weights to HuggingFace format
Download additional weights
Run accompanying weight delta script

Versions where this has already been done, including with 4-bit post-training quantization (GPTQ) which reduces the needed VRAM by 4x, are available on the HuggingFace Model Hub. Gradio can be used to serve models with a User Interface.

Results

We compared the performance of the five models across the two tasks after both zero- and few-shot prompting where the few-shot prompt contained three examples. The code for reproducing the experiment is available here.

Original Task

OpenAI’s gpt-3.5-turbo outperformed all alternatives in both tasks driven primarily by a huge performance gap on the topic task as measured by a weighted-average F1 sore. An F1 score is a measure that combines a classifier’s precision, how many it is right when assigning a data category, and recall, how many examples of each data category it identifies.

Pre-trained model performance by task and training examples

Several open-sourced models — Wizard-Vicuna-138-Uncensored-HF, gpt4all-13b-snoozy, Nous-Hermes-13b — achieved higher scores than gpt-3.5-turbo with zero examples on the partisanship task. That said, all models struggled on that task likely because the task of evaluating a media publisher’s partisan lean requires far more input data than their three most viral posts. Perhaps, chain-of-thought prompting which encourages LLMs to explain their outputs may have performed better, but that shifts the work to the researcher to parse explanations into classifications.

Adding examples had a mixed effect. For some models on some tasks like gpt-3.5-turbo on the partisanship task and gpt4all-13b-snoozy on the topic task, it provided a big performance boost. For others like Wizard-Vicuna-138-Uncensored-HF on the partisanship task, it decreased performance.

Knowledge Distillation

Fine-tuning is a strategy for teaching a model how to perform specific tasks through numerous examples that yields stronger performance. As previously mentioned, we are constrained by the human-labeled data we have. A workaround that has shown success is a bootstrapping method referred to as Constitutional AI. This is where an AI, rather than a human, supervises the training of a different AI system.

As data annotation can be challenging, many processes include quality provisions such as annotator agreement to ensure valid ground truth labels. Self-supervision is especially prone to error. To safeguard against this, we introduce Confident Learning. Confident Learning prunes noise in training data by using a simplified model to predict out-of-sample probabilities for each data class and removes examples where the predicted probability is lower than the per-class threshold.

We apply this to the topic task because the best results were achieved on it. The complete process looks like this:

First, for the Constitutional AI phase, we run 1,000 posts through gpt-3.5-turbo to get its topic predictions. Then, we use Confident Learning with a logistic regression classifier and Instructor embeddings to filter out labels with a high likelihood of being incorrect. Last, we fine-tune a distilbert (and ernie to see if we could squeeze out more performance), a much smaller language model, using the filtered labels. This has the dual added benefit of improved performance and faster runtime.

The distilbert and ernie models were fine-tuned using the same training regimen: layerwise learning rate decay using a linear scheduler with warmup and early stopping similar to the process described in training the Ad Type classifier here.

The fine-tuned distilbert model outperformed all of the open sourced models originally tested and adding confident learning improved its performance by five F1 points. That said, performance still was not great so we repeated the process using a slightly bigger model, ernie, which we have observed do well previously in text classification tasks. This made a huge difference; the F1 score increased by nearly 20 points to nearly the same score as gpt-3.5-turbo! Strangely, confident learning slightly decreased the performance.

Cost

While scaling language model size leads to stronger performance, it comes at the tradeoff of speed…translation…cost. For this benchmark, we created an example task where the model receives a 250-character input (140 GPT tokens according to tiktoken) and had it produce 100 classification to get the average Time per Token generation.

We ran the experiments using GCP’s Compute Engine on a NVIDIA A100 with 40GB of VRAM. These instances cost $3.93 to run per hour. There is also a significantly reduced cost of $1.25 to run this as a spot instance. Spot instances are cheaper because they run the risk of being terminated early to accommodate demand from other users and are not always available. This is tenable for one-off tasks like prototyping, but not ongoing data processing systems.

We can use this to estimate the cost per 1,000 token generations, as advertised by OpenAI, and the cost of a sample job. Returning to the US media ingestion, we collected 2.6 million Facebook posts in April. The cost of processing that is shown in the April columns.¹

gpt-3.5-turbo at $0.002/1k token completion and $0.0015/1k prompt tokens was 3x cheaper than the open-sourced, 13 billion parameter models using regular instances and comparably priced when using spot instances. If we were to run the open-sourced models with GPTQ, we could fit them on a NVIDIA L4 instance. At $0.56004023 an hour or 1/7th the price, this would bring the April run cost of the cheapest model, gpt4all-13b-snoozy, to roughly $227, less than half the gpt-3.5-turbo cost.

The biggest cost savings of all come from the knowledge-distilled models, distilbert-base-uncased and nghuyong_ernie-2.0-base-en. Not only are they much cheaper because they are 50x faster, but their smaller memory imprint allows them to be run on cheaper hardware with less VRAM like the NVIDIA T4. Google prices these instances at $0.35 per hour or $0.1155 at spot prices. This brings the total April data processing cost to under $4!³

Conclusion

Depending on the accuracy required of the task, few-shot large generative text models may be equipped to completing it. OpenAI’s gpt-3.5-turbo remains far better than its open-source counterparts. Yet, for cost-sensitive organizations, knowledge distillation is a more attractive option as it achieves similar performance at a fraction of the cost.

About NYU Cybersecurity for Democracy

Cybersecurity for Democracy is a research-based, nonpartisan, and independent effort to expose online threats to our social fabric — and recommend how to counter them. It is a part of the Center for Cybersecurity at the NYU Tandon School of Engineering.

Would you like more information on our work? Visit Cybersecurity for Democracy online and see how tools, data, investigations, and analysis are fueling efforts toward platform accountability.

Footnotes

The analysis of this article was initially run in June 2023. Leader board rankings have changed since then.
This is an underestimate as the cost. The average post length is 121 tokens plus as additional 43–296 tokens for the prompt instruction which would increase the Time per Token. For example, the true cost of running gpt-3.5-turbo is likely closer to $650.
The benchmarks were conducted on NVIDIA A100 instead of the T4 so the true cost is likely higher as runtime would be slower. The same is true for the NVIDIA T4 estimations.