Getting Started with Transformers & HuggingFace

Yash Kushwaha
tiket.com
Published in
6 min readJul 25, 2022

Do you ever wonder how neural machine translation (google translate) works or how ViT(Vision Transformer) has become the most used image classification model than the existing ResNet?

The common part for both of the use cases lies in the Transformers.

Photo by Aditya Vyas on Unsplash

No, not this one!
Transformers are designed to process sequential input data, such as natural language, with applications for tasks such as translation and text summarization, etc. Unlike in RNNs, transformers process the entire input all at once (i.e. in parallel). Transformers result in faster results as compared to RNNs & LSTMs.

Transformer works on a concept of Attention introduced in a paper that defines transformers as Attention Is All You Need. A great article further summarized by Jay Alammar in this post.

To summarize, a transformer (considering a language transformer) works in an Encoder & Decoder model, which consumes and processes all input at once, then goes through the Decoder and returns the result.

Transformers in a nutshell

An Encoder consists of 2 core elements:
a. Self-Attention
b. Feed-Forward Neural Network
Whereas a Decoder consists of 3 important elements:
a. Self-Attention
b. Encoder-Decoder Attention
c. Feed-Forward Neural Network

The breakdown is done in the below steps:

  1. Convert input sentence into the word embeddings by generating 3 weight parameters and Query(q1…qn), keys(k1…kn) & values(v1….vn) vectors that are randomly initialized, which are then iteratively improved with back-propagation. These vectors are created by multiplying the embeddings by the three matrices that we trained during the training process, where the dimension of each vector is 64.
Pictorial representation of weight

2. Then we are generating a new score by multiplication of q1*k1…qn*kn for each word.

3. Divide the score by 8 or perform a square root on the score for each individual score and then apply the softmax function.

Attention formula used

4. Let's say we perform an operation: values V1*Softmax1……Vn*Sn and store it as Z(Z0…Z7) called attention head.

5. Since transformers are based on the concept of “multi-headed”, it will be focussing on other words of relevance as well. So we concatenate on the head matrices (Z1…..Zn) and then multiply them by the additional weight matrix WO which was trained jointly with the model. Let's say the resultant matrix would be a Z matrix that has all the information and would be passed on to the Feed-Forward Neural Network.

Pictorial representation for concatenation.

6. Coming back to the word embeddings, there is a problem with positional arguments to determine the ordering of the words, especially when converting them into embeddings. To overcome that issue, we add “positional encoding vectors” into the “embedding vectors” so that we can introduce the concept of a time series. The nearer word will be the lesser the distance (similar to the Word2vec concept).

7. Now in the Feed-Forward Neural Network (FFNN), add and normalize the output.

8. Let’s move to Decoders. The Decoder consists of 3 elements as discussed above. Most of the processes in Decoder are similar to Encoder. However, the main differences are in the Encoder-Decoder Attention and Inputs. In this case, the output from the Encoder is passed through the Encoder-Decoder Attention followed by FFNN. The generated output is again passed as an input to the Decoder for all the words in the output sequence until we reach the EOS. This can be further understood from the below diagram.

The complete architecture of transformers.

Note: Input processing at Encoders is done in parallel at once but output processing is done one by one.

Comparison between different transformers

There are types of transformers available that are improved over a long period of time, a simple comparison can determine the best transformer for your use case.

Source: Comparison of BERT over different versions.

Given the basic understanding of Transformers, let us implement them with the help of hugging face library as our secret weapon:

About HuggingFace

HuggingFace is used to build, train and deploy state-of-the-art models powered by the reference open source in machine learning.

HuggingFace provides broad categories to choose from eg:

Getting Started

 ! pip install transformers

Head over to the Hugging Face models page:

Feel free to play around with different models, for starters let us choose a simpler one e.g. Text Classification:

There are 2 ways to work around the different models:

1. Use a direct inbuilt pipeline (a lot simpler)

from transformers import pipeline
pipe = pipeline("text-classification")
print(pipe("Alive is awesome"))

The output looks like this:

There is a total of 16 direct-use pipelines (link). However, if u want to use your specific model from the model store, you can use that as well along with a different tokenizer, which is provided by default.

2. Use in transformers

This method is available to have a more robust way to declare the model as well as tokenizer available for the need.

Example:

from transformers import AutoModelForTokenClassification, AutoTokenizerimport torchmodel = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")sequence = ("The central bank is likely to signal an exit from negative interest rate policy. That won't dent the appeal of investing in fiat alternatives like cryptocurrencies, one observer said..")inputs = tokenizer(sequence, return_tensors="pt")tokens = inputs.tokens()outputs = model(**inputs).logitspredictions = torch.argmax(outputs, dim=2)

Let’s see the predictions:

for token, prediction in zip(tokens, predictions[0].numpy()):print((token, model.config.id2label[prediction]))

Appendix for predictions:

O, Outside of a named entity
B-MIS, Beginning of a miscellaneous entity right after another miscellaneous entity.
I-MIS, Miscellaneous entity
B-PER, Beginning of a person’s name right after another person’s name.
I-PER, Person’s name
B-ORG, Beginning of an organisation right after another organisation.
I-ORG, Organisation
B-LOC, Beginning of a location right after another location
I-LOC, Location

Image segmentation using Transformers

Transformers can be used for image segmentation as well. You can access it from here. the model used here is “facebook/detr-resnet-50-panoptic”.

A sample image used for image segmentation

Fine Tuning Pre-trained Model On Custom Dataset

Now we can also fine-tune the model on your own dataset may it be TensorFlow or Pytorch (link).
For fine-tuning your own model please refer to this notebook example [here].

Conclusion

  1. We started with the basics of transformers to how to calculate self-attention from these v, k, q vectors and pre-processing on the inputs for the Encoder and Decoder stacks.
  2. Compared different transformers.
  3. Getting a hand on with transformer use in NLP as well as Image Clustering.
  4. Have a look at fine-tuning our own model with our own data.

References & Recommended Reads

  1. Attention Is All You Need Research paper by Ashish Vaswani
  2. Jay Almar’s article on transformers.
  3. https://towardsdatascience.com/transformers-89034557de14
  4. https://github.com/huggingface/notebooks/blob/main/transformers_doc/task_summary.ipynb

--

--