Transformers and Working Principles

Hüseyin Çayırlı
Huawei Developers
Published in
8 min readJan 15, 2024
Transformer Architecture

Introduction

Hello everyone! 👋
In this article, I will talk about what transformer models are, how they work, what they are used for and where they are used.
Transformer architecture was proposed in the article “Attention Is All You Need” published in 2017, and this proposal was a great revolution. This architecture, which pioneered the chatbots that are especially popular today, changed the course of Natural Language Processing (NLP) studies. So how did this happen? Let’s take a more detailed look.

Natural Language Processing (NLP)

To understand the effects of transformers, it is important to first understand what natural language processing is. Natural language processing refers to a field that focuses on understanding human language using machine learning methods. Instead of understanding words one by one, it tries to grasp the semantic context of the entire sentence or paragraph. Example application areas of NLP models are:

  • Sentence Classification
  • Classifying Words in Sentences
  • Creating Text Content
  • Creating Answers Based on a Question Given in a Text
  • Language Translation

To handle the above-mentioned tasks, Seq2Seq architectures containing LSTM or RNN were proposed before the Transformer model architecture was introduced to the literature. Seq2Seq is a machine learning model that transforms a series of inputs into another sequence. Seq2Seq models consist of two structures called encoder and decoder. The encoder structure functions to extract and encode the context in the input. Decoder decodes the code and converts the information from the encoder into the desired output. In Seq2Seq models, LSTM or RNN layers can be used within the encoder and decoder. However, Seq2Seq models with LSTM have been shown to lose certain context information, especially when faced with long sentences and paragraphs. For example,

“Don’t go to the beach when it is snowy”

in this sentence, while it can understand the context of each word in the sentence at a certain level, in some cases the information about the word “Don’t” may be lost and the context extracted by the model may change as follows:

“Go to the beach when it is snowy.”

This was one of the important problems encountered in the development of NLP models.

Seq2Seq Model Architecture

Transformer

The Transformer model can offer an effective solution to the context loss problem encountered in natural language processing applications and Seq2Seq models built with LSTM and RNN. It uses the attention mechanism to overcome this problem.

Attention Mechanism

Attention mechanism is a neural network architecture that allows a deep learning model to focus on specific and relevant aspects of input data. This helps machines better understand the input and produce appropriate output.

When we look at the example in the image above; In a model that translates English into French, it can understand from the attention map of each word that it should produce the sentence “Comment se passe ta journée” in its output after the sentence “How was your day” passes through the model. Attention blocks check the context of the words themselves and other words and create the attention map. By looking at the attention map produced, the model can provide a more appropriate output without loss of context.

So, how does transformer do this? What stages does the input data go through? Let’s examine it together.

Encoder Block

The main purpose of the encoder block is to provide the necessary data for the output we want to get by encoding the given input data. This data includes the words and context in the given sentence. It uses the encoder block, word embedding, positional encoding, multi-head attention, normalization and a series of linear layers to perform the encoding. Let’s talk about the stages of word embedding, positional encoding and multi-head attention.

Word Embedding

Word Embedding

Word embedding is the conversion of words into vectors that reflect their semantic similarities. In this way, the relationships and meanings of words become better understood and can be used in natural language processing (NLP) applications. After the inputs pass through the word embedding layer, they turn into meaningful vectors and become ready for the next process. This process preserves the semantic properties of the word while translating it into a mathematically operable form, allowing the model to better grasp the meaning of the word.

Positional Encoding

Formula of Positional Encoding

Positional Encoding enables converting the word vectors resulting from the word embedding layer of text data into numerical vectors that include the order information of the words. In this way, models can better understand the relationships and meanings of words. The above formula is applied according to the order of the words and the resulting values ​​are collected with the vectors emerging from the embedding layer. Thus, the input data includes position information and contributes to a better understanding of the model.

Positional Encoding

Data passing through the positional encoding layer becomes ready to be processed in the encoder section of the transformer model.

Multi-Head Attention

Multi-Head Attention Architecture

Vectors coming out of the Positional encoding layer enter the Multi-Head Attention layer. Input data is transmitted through three different channels: key, query and value. And passes through different linear layers for each of these channels. The concepts of key, query and value are as follows: In Transformer models, query represents the information sought, while key is the context or reference. Value refers to the searched content. The attention map is obtained by multiplying the query and key together, and these scores are then used to calculate the weighted sum of the value rates. This weighted sum is then used to calculate the model’s output. Then, they create the attention block by passing through softmax, normalization and another linear layer. The reason why operations are performed on the same data in the encoder is to better understand the context of the given input. For example;

I am eating pizza and it is delicious.

It is helped to understand that the words pizza and it in the sentence represent the same object.

Features coming from the positional encoding layer passes through more than one (usually 8) Multi-Head Attention blocks and their outputs are combined. That’s why this structure is called Multi-Head Attention. The combined outputs are then reconnected by adding them end-to-end into a specified number of blocks. The output of the last layer now becomes the output of the encoder. The output of the encoder contains information about our input data. The decoder will be able to produce the desired output thanks to the data coming from the output of the encoder. For example, if we assume that there is a model that translates English — French, the encoder block creates output by coding the English sentences coming as input, and thanks to the features in the created encoder output, the decoder presents the desired French translation as output. So how does the decoder do this? Let’s examine it.

Decoder

Decoder Block

The decoder block processes the data from the encoder block for the desired output and performs the conversion. In a model that translates English — French, the encoder processes the English sentence and sends the encoded data to the decoder. The decoder block enables obtaining the desired French output by processing the data belonging to the English sentence in the encoder output. Now let’s examine what is done here differently from the encoder block.

Just like the encoder, the decoder has word embedding and positional encoding layers. While the input sentence is given as is in the encoder, only the starting token is given to the sentence in the decoder. This token first enters the attention block within itself, and then its output enters the attention block together with the encoder output. This part is now where the encoded data coming from the encoder is processed. After the processed data passes through layers such as normalization, linear and softmax, its output is given to the decoder as a second input. The second input goes through the same stages and this cycle continues like this. This cycle continues until the end of sentence token comes from the decoder output. The tokens obtained between the sentence start token and the sentence end token become the output of the model.

The logic of Transformer models is roughly like this. These models can be trained with very large datasets and the input size can be increased from sentences to paragraphs. With this advantage, it has brought a brand new feature to natural language processing models. ChatGPT, Bard, Bing AI, Claude, etc., which are large language models that have become very popular today. The transformer model forms the basis of many chatbots.

--

--