Two minutes NLP — Beginner intro to Hugging Face main classes and functions

Pipeline, Datasets, Metrics, and AutoClasses

Fabio Chiusano
NLPlanet
5 min readFeb 23, 2022

--

Hello fellow NLP enthusiasts! Today we see an introductory tutorial of a very popular NLP library, namely Hugging Face. This article contains an overview of its main classes and functions with some code examples. Enjoy! 😄

Hugging Face is an open-source library for building, training, and deploying state-of-the-art machine learning models, especially about NLP. Let’s dive right away into code!

Hugging Face provides two main libraries, transformers for models and datasets for datasets (of course 🙂). You can install them using pip as usual.

Pipeline

Using pipeline from the transformers library is the quickest and easiest way to start experimenting with the models: feed the pipeline object with the name of a task and a suitable model is downloaded automatically from the Hugging Face model repository, and it’s ready to use!

There are several tasks already managed by the library, for example:

… and many others. You can find the complete list here, there are Computer Vision and Audio tasks as well.

In this article, we see a pipeline with the sentiment analysis task. In order to predict the sentiment of a sentence, just pass the sentence to the model.

The model output is a list of dictionaries, where each dictionary has a label (for this specific example, with values “POSITIVE” or “NEGATIVE”) and a score (i.e. the score of the predicted label).

You can feed the classifier with multiple sentences and get all the results in one function call.

What if we want to use a different model from the model repository for sentiment analysis? We can do that by specifying the model and tokenizer arguments of the pipeline with the name of the model to use.

The model page provides useful info about the specific model for a correct interpretation of its results, such as how its output is formatted.

Dataset

With the dataset library we can easily download some of the most common benchmarks used in NLP.

Let’s try loading the Stanford Sentiment Treebank (SST2), which consists of sentences from movie reviews and human annotations of their sentiment. It uses the two-way (positive and negative) class split, with only sentence-level labels. We can find the SST2 dataset under the datasets library, stored as a subset of the GLUE dataset. We load the dataset using the load_dataset function.

The dataset comes already split into train, validation, and test sets.

We can call the load_dataset function with the split argument to get directly the split of the dataset we are interested in.

If we want to explore the dataset using Pandas, we can easily create a dataframe using the dataset object directly.

Dataframe showing sample sentences and labels from the Stanford Sentiment Treebank (SST2) dataset. Image by the author.

Pipeline on GPU

Now that we have loaded a dataset about sentiment analysis, let’s try using a sentiment analysis model with it.

To extract the list of sentences in the dataset, we can access its data attribute. Let’s predict the sentiment of 500 sentences and measure how much time it takes.

It took 21.8 seconds to predict the sentiment of 500 sentences, with a mean of 23 sentences per second. Not bad, but we can do better leveraging a GPU.

In order to make our classifier use the GPU, we must create it with pipeline and passing device=0 : by doing so, we are asking to run the model on the associated CUDA device id, where each id starting from zero is mapped to a CUDA device, and the value -1 is associated with the CPU.

This time, predicting the sentiment of 500 sentences took only 4.1 seconds, with a mean of 122 sentences per second, improving the speed by roughly six times!

Metrics

What if we want to test the quality of our sentiment classifier on the SST2 dataset? Which metric should we use?

In Hugging Face, metrics and datasets are paired together in the datasets library. In order to retrieve the correct metric, we can call the load_metric function with the same arguments that we used with the load_dataset function.

Then, we call the compute function of the metric object using as arguments the predictions made by the model and the references taken directly from the dataset. For the SST2 dataset specifically, the metric is accuracy.

AutoClasses

Under the hood, the pipeline is powered by AutoModel and AutoTokenizer classes. An AutoClass (i.e. a general class like AutoModel and AutoTokenizer) is a shortcut that automatically retrieves the architecture of a pre-trained model (or tokenizer) from its name or path. You only need to select the appropriate AutoModel for your task and its associated tokenizer with AutoTokenizer: in our example, since we are classifying text, the correct AutoModel is AutoModelForSequenceClassification.

Let’s return to our example and see how you can use the AutoModelForSequenceClassification and AutoTokenizer to replicate the results of the pipeline.

We create a tokenizer object using the AutoTokenizer and a model object using the AutoModelForSequenceClassification. In both cases, all we need to do is to pass the name of the model, and the library manages everything else.

Next, let’s see how to tokenize sentences using the tokenizer. The tokenizer output is a dictionary composed of input_ids (i.e. the id of each token detected in the input sentences, taken from the tokenizer vocabulary), token_type_ids (used in models where two texts are needed for predictions, we can ignore them for now), and attention_mask (showing where padding occurred during tokenization).

The tokenized sentences are then passed to the model, which outputs the predictions. This specific model outputs five scores, where each score is the probability of a human review with a score from one to five.

The model outputs the final activations in the logits attribute. Apply the softmax function to the logits to retrieve the probabilities of each label.

Save and load models locally

Last, we see how to save models locally. This can be done using the save_pretrained function of tokenizers and models.

If you want to load a model that you saved before, you can load it using the from_pretrained function of the right AutoModel class.

Conclusion

In this article, we saw the main classes and functions related to the Hugging Face library. We learned about the transformers and datasets libraries and how to use pipeline to load models in a few lines of code, which can be run on CPU or GPU. We saw how to load benchmark datasets directly from the libraries and how to compute metrics. Eventually, we peeked into AutoModel and AutoTokenizer, ending with how to save and load models locally.

--

--

Fabio Chiusano
NLPlanet

Freelance data scientist — Top Medium writer in Artificial Intelligence