Easy text classification using minimal code and AI

Sinch Blog
Published in
10 min readJan 12, 2024


Hi, I’m Lucas Cléopas, and I’m Machine Learning Engineer at Sinch

Nowadays, to kickstart a project that incorporates Artificial Intelligence (AI) for natural language processing (NLP) is easier than ever before. The recent progress in these models and their practical applications demonstrate that you can create a proof of concept for a language task with minimal lines of code.

For instance, text classification is a fundamental task in NLP where you assign a label to a document, message, or any other textual content. This task is quite common and has many practical uses. It can be applied in a wide variety of situations, making it very versatile. Not only does it help improve system efficiency, but it also saves a lot of time.

By having some knowledge in AI, you must be familiar with the challenges of obtaining proper data for training text classification models. Perhaps, you’ll only have the target data that needs to be classified and using that for training is a major mistake. So, what options do you have?

Thankfully, we can take advantage of the AI evolution and classify texts without any beforehand training. The task of classifying text without using any data samples to train language models is called Zero-shot text classification.

Specifically, the models used for this task are trained in Natural Language Inference (NLI) [1]. In NLI, the main goal is to determine the logical relationship between two sentences to identify whether one sentence entails, contradicts, or remains neutral in relation to the other sentence. If you want to go deep into this topic, consider reading this article [2].

But don’t worry, we won’t implement all of that because it has already been done! Let’s consider you need to classify news as World, Business, Sports, or Science/Technology. We will see in practice how a Large Language Model (LLM) can be useful for text classification tasks without a labeled dataset to train a model. What you’ll need to know is programming logic and some prior experience with Python.

Roll up your sleeves

First, you need to have a Python environment to start programming. I recommend using notebooks within Google Colab, since we have access to free resources such as GPU (graphics processing unit). If you don’t know how to start using Google Colab, don’t worry, they have introductory content here (https://colab.research.google.com/?utm_source=scs-index).

Then, you need to install the necessary libraries for running and evaluating the models and datasets:

!pip install transformers==4.35.2 datasets==2.15.0 scikit-learn==1.2.2 
  • `transformers`: The transformers library, developed by Hugging Face, provides state-of-the-art NLP models like BERT and GPT-3. These models are pretrained on large text corpora and can be fine-tuned for specific NLP tasks.
  • `datasets`: This is another library from Hugging Face, which provides an efficient, user-friendly, and scalable repository of datasets for Machine Learning and NLP. It offers features to load, share, and manipulate datasets easily.
  • `scikit-learn`: It’s an open-source machine learning library in Python. It comes with various classification, regression, and clustering algorithms, along with tools for data preprocessing, model selection, and evaluation.

After installing the libraries, the next block of code includes two import lines. The first line imports the `pipeline` module from the transformers library, while the second line imports the `torch` library.

from transformers import pipeline 
import torch

The pipeline module of the transformers library simplifies the process of using pre-trained models for specific NLP tasks. It provides a simple and convenient interface for performing text inferences using models like those available in Hugging Face, abstracting much of the underlying code complexity. This abstraction makes it easier to use sophisticated models for tasks such as text classification, translation, text generation, and more, without the need to write a large amount of code!

The torch library is being used in the code just to check if the GPU is available in the environment for faster processing.

In the second block of code, the variable `device` is set to `”cuda”` if a GPU (CUDA-compatible graphics card) is available, and otherwise set to `”cpu”`. Then, the `pipeline` function is used to create a zero-shot classifier passing the task name `“zero-shot-classification”`. In the example, we are using the `“facebook/bart-large-mnli”` model. The device (CPU or GPU) is specified based on the value of the `device` variable.

device = "cuda" if torch.cuda.is_available() else "cpu" 
classifier = pipeline("zero-shot-classification",

The bart-large-mnli model is a text classification model trained on the MultiNLI (MNLI) dataset by Facebook researchers based on bart-large model [3]. It will be used here to classify text sequences into any specified class.

Now the model is ready to make inferences! You will need the text you want to classify and the labels you want the model to assign to the sentence (candidate labels). The model will give the probabilities of the sequence belonging to each candidate label.

In this case, we have this sentence “Microsoft Pushes Off SP2 Release Microsoft will delay the release of its SP2 update for another week to fix software glitches.”, and we want to classify it into World, Business, Sports or Sci/Tech. We know that the sentence is about Sci/Tech.

To perform the inference, all you need to do is:

candidate_labels = ["World", "Business", "Sports", "Sci/Tech"] 
sequences = ["Microsoft Pushes Off SP2 Release Microsoft will delay the release of its SP2 update for another week to fix software glitches."]

results = classifier(sequences, candidate_labels)

# Output:
# {
# "sequence":"Microsoft Pushes Off SP2 Release Microsoft will delay the release of its SP2 update for another week to fix software glitches.",
# "labels":[
# "Sci/Tech",
# "World",
# "Business",
# "Sports"
# ],
# "scores":[
# 0.44881758093833923,
# 0.22197122871875763,
# 0.19588175415992737,
# 0.13332943618297577
# ]
# }

The text is most likely related to the Sci/Tech label, according to the model’s prediction with a confidence score of 0.4488. What is correct! Remember that these scores are probabilities and add up to 1.0, indicating the model’s estimated likelihood for each label. In this output, the model is more confident about its prediction for Sci/Tech and less certain about the other categories. This means that sometimes the model can get confused if the classes are similar, but in this case, it got it right.

But, how good is this model?

We can evaluate this model (as well as others) using a text classification dataset, such as ag_news [4], which includes over 127,000 news articles spread across the same classes I mentioned earlier: World, Business, Sports, and Sci/Tech. In this evaluation, since the model has already been trained, we will only use the test set, which consists of 7,600 samples balanced over the labels.

First, let’s import evaluation methods and visualization libraries. `load_dataset` is a method from the datasets library that we installed earlier; it is used to load a dataset available in the HuggingFace repository. The classification_report method will calculate the classification results for us considering the main metrics such as Accuracy, Precision, Recall, and F1-Score. We can get and plot the confusion matrix by using `ConfusionMatrixDisplay` class. Consider reading this article [5] to better understand classification model evaluation.

from datasets import load_dataset 
from sklearn.metrics import classification_report, ConfusionMatrixDisplay

Now, let’s load the `ag_news` dataset and prepare data for evaluation. `dataset_labels` is a dictionary that maps the label name to the corresponding number in the dataset. `candidate_labels` is a list containing the names of the labels that we will pass to the model for reference. `sequences` is a list of the texts in the dataset that will serve as input for the model. Finally, `true_labels` is a list that stores the true labels for comparison with the predicted labels later.

dataset = load_dataset('ag_news', split='test') 
dataset_labels = {
"World": 0,
"Sports": 1,
"Business": 2,
"Sci/Tech": 3
candidate_labels = list(dataset_labels.keys())
sequences = dataset["text"]
true_labels = dataset["label"]

Based on this extracted information, we can reuse the instance of the previously created model to classify the texts in the test set of ag_news. We just need to pass the lists of texts and labels to the `classifier`. If you’re using the CPU, this execution might take a couple of hours.

results = classifier(sequences, candidate_labels)

Once we have the stored results, we will iterate through them and obtain the corresponding number for the predicted class. For instance, if the model predicted `“Business”`, we would map this target to 2 based on the previously created dictionary. This is done to compare the predicted values with the actual values more easily. Then, this comparison is performed using the `classification_report` passing the lists `true_labels` and `pred_labels`.

pred_labels = [dataset_labels[result['labels'][0]] for result in results] 

evaluation = classification_report(y_true=true_labels, y_pred=pred_labels, target_names=candidate_labels)

# Output:
# precision recall f1-score support
# World 0.58 0.77 0.66 1900
# Sports 0.95 0.92 0.93 1900
# Business 0.62 0.66 0.64 1900
# Sci/Tech 0.64 0.40 0.49 1900

# accuracy 0.69 7600
# macro avg 0.70 0.69 0.68 7600
# weighted avg 0.70 0.69 0.68 7600

Looking at the results, the model appears to do a good job in classifying the Sports category, showing high precision, recall, and F1-score. However, it doesn’t perform as well for other categories. The overall accuracy stands at 69%.

By looking at the confusion matrix (Figure 1), we can get a clearer view of the relationship between the true labels and the predicted labels. This matrix helps us understand why model is predicting wrongly labels; you can see that it was often confused Sci/Tech with the World and Business labels.

Figure 1: Confusion matrix. The darker the color, the more accurate the model is, and the lighter the color, the less accurate it is.

Let’s check a sample from Sci/Tech that was predicted as Business: “IBM to hire even more new workers By the end of the year, the computing giant plans to have its biggest headcount since 1991”. In fact, it’s about the business aspect of a technology company, where the dataset annotator classified it as Sci/Tech and our model as Business, but both categories could be considered as correct, which is an ambiguity. So, you should avoid ambiguity whenever possible in your dataset to mitigate the risk of performance degradation of your model.

However, we still can say that it’s possible to classify almost 70% accurately without much effort as the dataset is balanced, meaning without collecting training data, training models, or fine-tuning them.

If 70% accuracy is enough to meet your needs, this model can be used in your application/service. If it’s not, now you have an acceptable baseline so you can invest more effort to surpass it. For example, you can improve the results by:

  • Setting distinct labels that reflects the sentences to mitigate ambiguity in classification.
  • Prompting 3 or 5 samples to LLMs like GPT-3 to perform few-shot inference [6].
  • Fine-tuning a language model like DistilBERT with more samples, enabling the model to adapt and specialize for the specific task capturing intricate patterns [7].
  • Adding counterfactual samples into the training set to enhance the model’s ability to discriminate between labels [8].

Things to keep in mind

You can find various language models based on NLI in the HuggingFace repository. Some can be multilingual while others are specific to a particular language. Keep in mind that the more specific a model is to a certain task or language, the better it will perform. For example, if you want to classify texts in other languages than English, search first for your language specific model one, then if you don’t find search for a multilingual model.

Be mindful of the labels you define for classification. Language models can have biases associated with their training data. This can be a concern if you add “controversial” classes for the model to classify texts, such as classes related to gender, politics, and race. This could lead your applications to misclassify certain contents based on bias it holds. In [9], you can read more about this topic.


You just created a proof of concept on news topic classification! Zero-shot text classification models can be an interesting choice for conducting proof of concepts, as they do not require much effort and time to obtain something that works. Even if the zero-shot approach might not be ready for production, we can still make use of it to label data and save time on this time-consuming task.

At Sinch, we applied zero-shot text classification to set up a baseline to classify S.H.A.F.T. (Sex, Hate, Alcohol, Firearms, and Tobacco) content. The outcomes have enabled us to quickly find a solution for this proof of concept (using AI to filter unsolicited content in texts) and, then, we could enhance the results by fine-tuning other language models.

Interested to learn more about Sinch and perhaps become a part of our team? Check out our Careers page!


  1. MacCartney, Bill, and Christopher D. Manning. “Modeling semantic containment and exclusion in natural language inference”. Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), 2008.
  2. Davison, Joe. “Zero-Shot Learning in Modern NLP”. Joe Davison Blog. Available at https://joeddav.github.io/blog/2020/05/29/ZSL.html. Last accessed on December 13th, 2023.
  3. Lewis, Mike, et al. “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.” arXiv preprint arXiv:1910.13461 (2019).
  4. Zhang, Xiang, Junbo Zhao, and Yann LeCun. “Character-level convolutional networks for text classification.” Advances in neural information processing systems 28 (2015).
  5. Harikrishnan. “Confusion Matrix, Accuracy, Precision, Recall, F1 Score”. Medium. Available at https://medium.com/analytics-vidhya/confusion-matrix-accuracy-precision-recall-f1-score-ade299cf63cd. Last accessed on December 27th, 2023.
  6. Lin, Xi Victoria, et al. “Few-shot learning with multilingual language models.” arXiv preprint arXiv:2112.10668 (2021).
  7. Banjara, Babina. “A Comprehensive Guide to Fine-Tuning Large Language Models”. Analitics Vidhya. Available at https://www.analyticsvidhya.com/blog/2023/08/fine-tuning-large-language-models/. Last accessed on December 15th, 2023.
  8. Sinch. “Robustifying Conversational AI with Counterfactuals”. Medium. Available at https://medium.com/wearesinch/robustifying-conversational-ai-with-counterfactuals-55c73f7e54a6. Last accessed on January 10th, 2024.
  9. Draelos, Rachel. “Bias, Toxicity, and Jailbreaking Large Language Models (LLMs)”. Medium. Available at https://medium.com/p/37cd71a3048f. Last accessed on December 12th, 2023.



Sinch Blog

Follow us to stay connected to our minds and stories about technology and culture written by Sinchers! medium.com/wearesinch