Low-cost text AI is here

Sergey Gubanov
11 min readMay 5, 2022

--

According to 2019 Gartner survey, 37% of organizations use some kind of AI. And while AI has many forms and varieties, one thing remains consistent across different applications: creating (and operating) AI is an expensive endeavor.

Over the course of my career, I have worked as a team lead on various NLP systems: machine translation, speech recognition, summarization, et cetera. And what always bugged me, is that despite all the effort, scientific progress, and accumulated skills, projects never took less than three months; often they took much longer. And even three months is a lot of time.

Among many different kinds of AI, the text kind is one of the most ubiquitous

But recent breakthroughs in language modelling give us hope that the situation might change very soon, at least for text applications.

Since the seminal GPT-3 paper came out, we have been working on a low-cost system of our own, with emphasis on practicality and ease-of-use. We’d like to share some thoughts and experience, discuss what it takes to go from a paper to a usable tool, and invite you to try our system.

Why AI is expensive

If you outsource AI, the development can cost you from $20K up to $1M (source). Doing AI in-house is not a solution either: in some cases it might turn out to be even more expensive.

There are a few reasons that the process costs so much.

  • Need for customization. Despite the advent of open-source libraries, frameworks, and pre-trained models, things almost never work out-of-the-box. Most custom tasks require several iterations of experimenting with neural network architecture, hyperparameters, and even input data.
  • GPUs & infrastructure​. GPUs are notoriously expensive, but that’s not the whole story. Running experiments is a complex task (many opportunities to shoot yourself in the foot), so unless you invest into data warehousing, experiment tracking, and systems for building reproducible pipelines, you are at risk of wasting your time struggling with a model that should absolutely work, but, for some unknown reason, doesn’t.
  • Data. Garbage in, garbage out; your AI is only as good as the data you train it on. And if you don’t have a dataset ready, you are out of luck: you have to set up data collection through crowdsourcing platforms, which is a long, complex and costly process on its own.

The data problem is my personal favorite. Neural networks need data in large quantities: general wisdom is that you need millions of training examples for vanilla transformers, and 10K–100K of examples for BERT. But even that is, most likely, more that you can handle by in-house labeling.

So — you have to employ crowdsourcing like MTurk or Toloka; and that’s where the second data problem comes in: task specificity.

Consider (open-domain) question answering. What might initially seem like a well-defined problem, in practice has a continuum of variations:

Two (of many) possible styles of doing question answering

Depending on the product you are building, either style might be preferable. And in order to crowdsource, you need to explain the labellers what style they should use when labelling the data. And to properly explain that is much harder than it sounds.

It’s not just a matter of well-written instructions. You also have to prepare tutorials, exams, honeypots, measure quality through held-off validation, tune settings (simplest of which are cost and overlap), decompose hard tasks into several simpler ones, and generally go through several iterations; all that before you even get your first batch of usable data! Creating a high-quality labelling task is a project in itself — not dissimilar to training the core model, but, ironically, hated by data scientists.

And data scientists are not the only people who iterate on their work. Product managers often refine their requirements over the course of development — having seen more data, more edge cases, and more outputs of the current prototype system. A good thing, ultimately, but it further contributes to the data problem.

All that leads to the fact that just setting up data collection takes 2–6 months of non-recurring engineering (NRE). Given that, 3 months minimum for a complete project is not surprising at all.

Enter language models

It already happened with websites. You can go full custom: hire developers, develop frontend and backend, do your own devops, and so on. Or — you can sacrifice some flexibility and use a website constructor like Squarespace or Wix, thus radically cutting down the costs.

It is happening right now with mobile apps. You can do it yourself, or — if the app you’re trying to build is not too complex — you can use a no code app builder like Adalo or Glide.

And it is going to happen with AI real soon. Not for all possible cases, probably, but certainly for a large share of them. And the secret sauce is a technology called language models, and it has seen quite a lot of attention in the recent years:

  • GPT-2 (OpenAI, 2019, 1.5B parameters)​
  • GPT-3 (OpenAI, 2020, 175B parameters)​
  • Gopher (DeepMind, 2021, 280B parameters)​
  • GLaM (Google, 2021, 1.2T parameters)​
  • Wu Dao 2.0 (BAAI, 2021, 1.75T parameters)​
  • Megatron-Turing NLG (Microsoft & NVIDIA, 2022, 540B parameters)​
  • PaLM (Google, 2022, 540B parameters)​

What language models do is they continue text. You input a prefix, the model generates a suffix. That’s it. Doesn’t sound very useful, does it? Maybe so, but only until you discover a little trick called prompt programming.

Here’s how the trick works. Consider a specifically-constructed text:

What would be the most logical way to complete that? As humans, here’s what we will, most likely, do:

  • Consider the structure of the text we are to complete: well, it consists of several lines, on each line we have an English word, e.g. [House], followed by its part of speech, e.g. [noun]. The text ends in a word [Brave], so the next should come its part of speech.
  • Consider our knowledge of natural language: [Brave] is an adjective.

So, the most sensible way to complete the text is this:

And the good news is that if the language model is smart enough, it can mimic our human logic and decide to complete the text in the very same way! So, by putting new words at the end of our prompt, we can trick the model into outputting their respective parts of speech for us:

(This is just a toy, though: the real POS tagging is all about context)

And we achieved that with just 6 examples, no additional training, and with no tuning or experiments.

Remember those problems that make AI expensive? If we can make this approach work at scale, those problems are gone.

Bridging the gap between theory and practice

When we trained our own language model and tried to apply the prompt programming in practice, we quickly realized that the system we got is far from being practical.

Here is some stuff we had to incorporate into Tune the Model before it began to get traction among our users.

Idea #1. Drop prompt programming

Oh no! The very trick that makes it all work!

Unfortunately, prompt programming possesses several undesirable properties, which are very hard to get rid of:

  • Tuning is unintuitive. Performance of your system varies widely with the prompt, so you must tune the prompt. But how would you do that? Add a task description? Change the few examples you have? Add more examples? Fiddle with punctuation? Chain-of-thought?
  • A lack of robustness. Well, if you are done optimizing your prompt manually, I have some bad news for you: if you re-train the basic language model (or switch to a larger model), you have to go through the whole process again, because the prompt you got won’t work nearly as well.
  • Inability to accommodate more data. Given a maximum of 2048 tokens, there is a hard limit on how much data you can build into your prompt. Often that limit is too low.

Instead of prompt programming, we decided to implement prompt tuning: a gradient descent algorithm that finds the optimal prompt for you. You still have to supply a few examples, but you no longer have to concern yourself with everything else.

Not just ease of use: better quality too!

There is one downside: gradient-based prompt tuning is a little bit more data-hungry than prompt programming: you need about 100 examples instead of about 10. But is it really a problem, though?

Idea #2. Few-shot → low-resource

We believe that the idea is not to solve a task with as little data as possible, but rather to avoid the aforementioned problem of costly data-collection NRE. And in that regard, collecting 10 examples is almost the same as collecting 100 examples: you still don’t have to do crowdsourcing.

Therefore, as a metric for our project we chose quality (task-dependent) on an amount of training examples, which a single person can label in a day by themselves: 100 examples for generation tasks, and 600 examples for classification tasks.

And if the person labelling is a team’s product manager, they can transfer their wishes directly into the model, rather than going through an additional level of indirection, the crowdsourcing platforms.

Idea #3. Measure and minimize variance

In addition to the actual quality, another important metric we track while improving our system is the variance of quality. A metric, closely associated with ease-of-use, a.k.a. the number of experiments you have to run before you get the quality you need.

We found that the main source of variance is the operator skill, i.e. how well the operator of the system tunes the available hyperparameters. Prompt programming had the prompt as the main hyperparameter — a lot of things could go wrong there. Prompt tuning doesn’t have the prompt, but it still has several hyperparameters of a more classical kind (e.g. the number of tunable vectors). Still not good enough.

In order to minimize operator-induced variance, we chose the most straightforward way possible: reduce the number of hyperparameters to zero! Counterintuitively, for the proper hyperparameter tuning, the specific task is somewhat less important than the quality of the measurement setup (test set size, for instance), so if we control that ourselves, we are able to offer nearly-optimal hyperparameters automatically in most cases, AutoML-style.

With ±7 points of quality, prompt programming is a lottery!

The second-largest source of variance is data variance, i.e. which examples you put into your training dataset. The upcoming idea helps tremendously with that.

Idea #4. Task-to-task transfer learning

Transfer learning, in a nutshell, splits the training into two stages:

  • First, make the neural network learn how language works in general, without learning to do any specific task
  • Then, further teach that pre-trained model to solve the task you want (e.g. text summarization).

First stage might take billions of examples to train, but the trade-off is that you only have to do pre-training once, and that the second stage now requires much less data.

So, you apply the general knowledge of language constructs in your natural language task. Makes sense.

But you can go further than that. There is a plethora of datasets available for a range of specific natural language tasks: classification, retrieval, summarization, question answering, named entity recognition, and so on. The recent approach called T Zero explores a way to incorporate those supervised datasets into pre-training stage, in order to achieve task-to-task transfer.

The overwhelming urge to train on as much data as you can… Every NLP practitioner can relate.

The human equivalent of doing task-to-task transfer (vs. doing simple transfer) would be to hire a person with a PhD to do a text labelling job, instead of hiring a person with just a basic knowledge of the language.

Better transfer allows the network to transition from learning the task, to identifying the task:

This looks like question answering, with some notes of medium-length text summarization, and also a hint of nonsense detection.

— Our neural network (probably)

Can you learn a language with 100 examples? Certainly not. But if you already know a language, can you learn to perform a task in it, given 100 examples? Probably, depends on the task. And if, judging by those 100 exapmles, the task closely resembles something that you have spent a long time learning how to do properly? Then most likely. Yes.

When you look at it like that, it becomes less of a magic. Which is good, since in case of neural networks, it is quite rare that we really understand how something works.

Note: when a system is pre-trained on a lot of supervised datasets, a fair comparison with other systems becomes complicated, especially in low-resource setup. Even if you split your datasets into meta-train and meta-test, you still have to account for similarity between tasks, which is a messy thing to do.

Nevertheless, outside of academic context, it is obviously a good idea to pre-train on as much data as possible.

If you run Tune the Model on a public benchmark, take the results with a grain of salt. We’re good, but probably not that good :)

Let’s try Tune the Model on something! We are going to fine-tune a model that can classify whether a tweet contains irony.

https://tunethemodel.com/docs/classifier.html

  1. Install Python module
pip install --upgrade tune-the-model

2. Set up your API key

In case you don’t have a key yet, please follow this guide.

export TTM_API_KEY=<insert your API key here>

3. Load TweetEval irony detection dataset

The dataset consists of lines with the following fields:

  • text: a string feature containing the tweet.
  • label: an int classification label with the following mapping: 0 is non_irony, 1 is irony.

The text in the dataset is in English, as spoken by Twitter users.

Example:

{'label': 1, 'text': 'seeing ppl walking w/ crutches makes me really excited for the next 3 weeks of my life'}

Let’s load the dataset:

from datasets import load_dataset


dataset = load_dataset("tweet_eval", "irony")
train = dataset["train"]
validation = dataset["validation"]
test = dataset["test"]

4. Fine-tune a model

model = ttm.tune_classifier(
"tweet_eval-irony.json",
train["text"],
train["label"],
validation["text"],
validation["label"]
)
classifier.wait_for_training_finish()

A training phase takes between 30 minutes and 5 hours depending on a dataset size.

For this particular dataset, training finished in just 30 minutes.

After the model is trained, we automatically get the inference microservice for the model!

5. Infer

res_validation = [
model.classify(input=text)[0]
for text in validation["text"]
]


res = [
model.classify(input=text)[0]
for text in test["text"]
]

6. Find the best threshold

from sklearn.metrics import precision_recall_curve
from sklearn.metrics import classification_report
import numpy as np

precision, recall, thresholds = precision_recall_curve(
dataset["validation"]["label"], res_validation)

f1_scores = 2 * recall * precision / (recall + precision)
print("Best threshold: ", threshold)
print("Best F1-Score: ", np.max(f1_scores))

7. Apply threshold

y_pred = 1*(np.array(res) > threshold)
print(classification_report(dataset[‘test’][‘label’], y_pred))

Results look like this:

              precision    recall  f1-score   support           0       0.87      0.61      0.72       473
1 0.60 0.86 0.71 311
accuracy 0.71 784
macro avg 0.73 0.74 0.71 784
weighted avg 0.76 0.71 0.71 784

According to TweetEval GitHub leaderboard, 0.71 is the second best result in irony detection (remember the disclaimer about task-to-task-transfer-based system measurements, though). All in all, not too bad for a system with zero tuning.

--

--