Using a Pre-Trained Transformer Model and Tokenizer in Hugging Face to Classify Text

Yuan An, PhD
3 min readSep 26, 2023

--

This is a series of short tutorials about using Hugging Face. The table of contents is here.

In this lesson, we will learn how to classify text by using models in Hugging Face.

Text classification is the problem of assigning a pre-defined label to a piece of text. For example, if somebody wrote a review for a movie as “I like the movie very much, It is fantastic!” We want to label the review as POSITIVE or NEGATIVE. This is a text classification problem called sentiment analysis.

To work with Hugging Face, we typically create an instance of pipeline() which will take care of the rest of the classification process. The pipeline() class can do many things. For sentiment analysis, we will create an instance by passing ‘sentiment-analysis’ as a parameter.

Pipleline() needs a tool called ‘tokenier’ for pre-processing text and a model to perform the classification.

First of all, install the transformers from Hugging Face

! pip install transformers

Create an Instance of Pipeline() with the Default Model

If we don’t give a model and tokenizer to pipepline(), it will use a default model and tokenizer.

Let us import pipeline() from transformers and create an instance of pipeline():

from transformers import pipeline

classifier = pipeline("sentiment-analysis")

Since we didn’t provide a model to pipeline(), it will use a default model. In this case, it is distilbert-base-uncased-finetuned-sst-2-english.

Classify a Piece of Text by the Default Model

Let us prepare a piece of text:

text = "Wing sauce is like water. Pretty much a lot of butter and some hot sauce \
(franks red hot maybe). The whole wings are good size and crispy, but for $1 a wing \
the sauce could be better. The hot and extra hot are about the same flavor/heat. \
The fish sandwich is good and is a large portion, sides are decent."

Now, we can classify the text as POSITIVE or NEGATIVE by using the classifier:


classifier(text)

We got the result:

[{'label': 'POSITIVE', 'score': 0.9995321035385132}]

Create an Instance of Pipeline() with the bert-base-uncased BERT Model and Tokenizer

If we want to use a specific pre-trained model for classification, we need to supply both a model and a tokenizer. To do this, we will need to create instances of AutoTokenizer and AutoModelForSequenceClassification.

Let us import AutoTokenizer and AutoModelForSequenceClassification from transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification

We first create an instance of tokenizer from a pre-trained model. Here we will use a generic model, bert-base-uncased.

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Next, we will create an instance of the pre-trained model:

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

Remember, both the model and the tokenizer should be instantiated from the same pre-trained model.

Classify a Piece of Text by the BERT Model

Finally, we create an instance of pipeline again. But this time, we will supply the model and tokenizer:

classifier_bert = pipeline(‘sentiment-analysis’, model=model, tokenizer=tokenizer)

Now, we can classify the text by using the specific model:

classifier_bert(text)

We got the following result:


[{'label': 'LABEL_1', 'score': 0.526148796081543}]

Note: LABEL_1 means positive and LABEL_0 is negative.

The colab notebook is available here:

--

--

Yuan An, PhD

Faculty member in the College of Computing and Informatics at Drexel University; Doing research in NLP, Machine Learning, Ontology, Knowledge Graph, Embeddings