Choosing the best architecture for your text classification task

Viacheslav Zhukov
Toloka Tech
Published in
7 min readFeb 15, 2023

A review of models based on real-world applications

Modern large language models (LLMs) have demonstrated the best performance for many NLP tasks from text classification to text generation. But are they really a “silver bullet” or a one-stop-shop solution? Can they be applied across the board? Toloka’s ML team faces these kinds of tasks all the time, and our answer so far is a resounding “No.” Performance is not the only factor you should be concerned about when developing a model for a real use case. And you probably don’t want to spend your entire department’s budget on it either.

We’ve created a practical guide on ways to solve text classification problems — depending on how much data you have, the type of data, time and budget constraints, and other factors.

Approaches to text classification

Let’s start off with a brief overview of potential models and solutions you could use.

Old-school tf-idf models

Models in this category are founded on basic stats like word count and co-occurrences. Reduced feature space is usually passed to one of the classic ML models like SVM, MLP or Naive Bayes. This method is easy to implement and does not require any specific libraries or accelerators — you’re good to go with one of the classic solutions like sklearn or NLTK. Moreover, these models can easily handle both short and long texts. Given their relatively small size, they’re highly efficient when it comes to training, deployment, and inference.

Nevertheless, this approach has several drawbacks, the most important one being performance. Compared to other approaches, tf-idf models rank the lowest. Additionally, you’ll have to carry out extensive preprocessing of your texts (misspellings, stopwords, punctuation, lemmatization, and more). You also need a lot of data to produce a robust model — and your data should resemble academic texts, not social media, which can contain numerous misspellings as well as slang.

First embeddings and pre-trains

Word embeddings are excellent for text classification since each word (or sequence of characters) is represented by a vector of numbers containing useful information about the context, use, and semantics.

Take for example Word2Vec, its variations and implementation within the fastText library — a binary file that you can run to achieve a desired result. While you still need a large dataset to produce solid word embeddings and to train the classifier head, with proper configuration you can significantly reduce your preprocessing efforts. As a single multiprocessing library, it can run right in the system’s console.

On average, it takes anywhere from 10 minutes to an hour to create a fastText model between 300 MB and 2 GB. This model can handle texts of any size, and the inference is incredibly fast given that the sole focus is on text embedding construction and processing by an MLP. The availability of pre-trained word embeddings for a variety of languages makes it a baseline for almost any text classification problem.

Small transformers

This category includes transformer-based language models such as BERT and RoBERTa — currently considered to be state-of-the-art NLP. Even though it may seem obvious that a model with 110 million parameters is “small”, and a model with 175 billion parameters is “large”, it’s not easy to distinguish between small and large transformers. Yet there are several key advantages that make transformers a great option. Namely, they’re resistant to misspellings and usually require little preprocessing compared to other models.

Since you probably won’t be training your own BERT and will likely be using a pre-trained model instead from a library or hub (like Hugging Face), you can use comparatively small datasets to create decent models. If you have a common task, and your domain is similar to one that already has a tuned version, you may only need a couple hundred or thousand samples to slightly tune the model and achieve great results. The model size usually ranges from 600 MB to several gigabytes. It’s also a good idea to have access to GPUs as the training may take some time to complete.

However, there are also some disadvantages to consider. The produced model is much slower compared to Word2Vec, so if you require real-time inference you’ll need to either use a GPU device or invest in model optimization (graph optimizations, ONNX, DeepSpeed, and others). Additionally, the length of texts that a model can handle is limited by its architecture and is usually about 512 tokens (~380 words). In practice, a more straightforward approach like taking 192 tokens from the beginning works well.

LLMs

It’s likely that you don’t have your own LLM — they’re really big! The size of the large downloadable version of T5 is about 40 GB. You’ll have to deploy this model somehow and inference may take time. In which case, you’ll either need to use an expensive computational cluster or opt for a service that provides an API like OpenAI with its GPT-3 model.

One benefit is that LLMs require little data for tuning, and you don’t need to worry about preprocessing. As a side note, approaches like zero-shot or few-shot don’t work well for text classification problems. You’ll need to either fine-tune or p-tune (prompt-tune) your model. Working with APIs is on a whole other level — you’ll need to consider internet access, data security, SLAs, pricing, and more. However, achievable performance is the biggest plus.

Choosing by scenario

As expected, all these approaches have their own pros and cons. Consider a variety of architectures to find a good fit for a specific real-world task. We recommend basing your decision on the actual requirements you have for your text classification problem.

Let’s go over some of the most common cases we’ve encountered — as well as our recommended approaches. Your text classification task will likely fall under one of these scenarios.

Your goal is to create a high-performing model

If performance really matters, choose a transformer. And spend some time searching for the optimal architecture and pre-trained weights, expanding your dataset, optimizing your pipeline and parameters, and so on. Also, try tuning an LLM, either your own or via an API. Just know that it will take time.

You have little data available

Go for LLMs or a small tuned transformer. If your task is general enough, you can leverage extensive model catalogs that are available across various hubs.

You have a lot of data available

Start with fastText and establish a baseline. This may be enough if your performance requirements aren’t that strict. If they are, go for the fine-tuning process of one of the small pre-trained transformers. If API is an option and you have a dedicated budget, you can try tuning an LLM too.

You have privacy or security concerns about your data

You don’t want your data to leave a specific contour and be logged by a third-party service. API is not an option until you make your logging and security concerns clear to the provider. Choose local models that you can deploy yourself according to your hardware and software setup.

You have a common task and domain

Someone has probably already solved the task for you and you can apply their solutions. Simply look for applicable tuned transformers. LLMs will likely work too, but we’ve noticed that previously tuned transformers can outperform LLMs if the dataset is extremely small (a couple hundred samples). The difference though is minute.

Your task or data is very specific

In this case, LLMs have the best performance compared to other approaches. Training an adequate small transformer is a challenge under these circumstances, and other architectures usually perform much worse.

Your model will be used for online inference (under hundreds of milliseconds)

Try fastText because of its speed. If you’re not satisfied with the quality, you can try using a small transformer. However, you’ll most likely have to use an optimization mechanism or deploy your model with access to a GPU. There are lots of ways to speed up inference with BERT-like models. LLMs are usually not an option here unless they’ve been optimized.

Your model will be used for batch processing only

Opt for a large model. While it seems straightforward at first glance, you still need an understanding of timing so that your batches don’t stack up.

You’re concerned about scalability

For example, your model will be widely used, and you expect high RPS on its endpoint. You’ll probably apply an orchestration mechanism like Kubernetes, assuming that your pods can be deployed and destroyed quickly, in which case there may be restrictions on model size (namely, image size). Therefore, fastText and small transformers are common options.

You have access to a computational cluster with modern GPUs

If so, you’re lucky! You can play with different types of transformers, even large ones, but the real question is, can you use a node pool with GPUs for inference? If yes, you can choose whatever you want, even LLMs. If not, you’ll probably find yourself optimizing a small transformer.

You have no access to modern hardware accelerators

That’s unfortunate to hear! Start with basic approaches and train a fastText model. You can also train a transformer model in this setup, but it will require a deeper understanding of optimization mechanisms. Another option is to move from classic libraries into something more specific like FasterTransformer.

You have a lot of time to build a model

Try any architecture, different pre-trains or parameters, and perform “loss-watching”.

You have almost no time to build a model

In this case, fastText and API-accessible LLMs are good options. If your task is popular, you can tune an appropriate small transformer with a default set of hyperparameters. Still, API-accessible LLMs are usually the best choice.

As you can see, there’s no “silver bullet” or “one-size-fits-all” approach. As a key takeaway, try to avoid looking at your text classification challenge with only performance in mind. There are other factors that are worth considering like inference and training time, budget, scalability, privacy, data type, and so on.

Connect with us

Please feel free to reach out to us via Slack if you have any questions. We’re here to help.

--

--