Transformers in NLP: Revolutionizing Language Processing

8 min readFeb 5, 2023

Transformers in NLP have improved the state-of-the-art results for various NLP tasks by overcoming limitations of previous NLP models like RNNs and CNNs. The transformer architecture, introduced by Vaswani et al. in 2017, uses multi-head attention to process input sequences, resulting in improved performance in NLP tasks such as machine translation, text classification, and question answering. It has become the standard for NLP models.

In this article, we will explore the architecture, and impact of transformers in NLP. We will see how transformers work and how they are changing the way NLP models are built and used.

The transformer architecture consists of an encoder and a decoder. The encoder processes the input sequence and creates a representation that captures the relationships and dependencies between the tokens. The decoder takes the representation from the encoder and generates the final predictions.

Not every transformer model requires both an encoder and a decoder! It depends on the task and the model architecture. For language generation tasks, only a decoder is needed, while for tasks like machine translation or question answering, both an encoder and a decoder are necessary. The encoder processes the input to create a representation capturing dependencies between tokens, which is then used by the decoder to generate predictions.

Lets take into account the cases that require both an encoder and a decoder and delve into them further .

First : Implementing the Encoder

The MultiHeadAttention layer implements the attention mechanism in transformers. It calculates the attention scores between the queries, keys, and values to determine the importance of each value in producing the final output. The queries, keys, and values are matrices that are projected from the input embeddings and then split into multiple heads. The attention scores are computed for each head, and then the results are concatenated and projected back to a single vector to obtain the final output. The final output is a weighted sum of the values, where the weights are determined by the similarity between the query and key for each head.

The TransformerBlock combines a MultiHeadAttention layer with normalization, feedforward layers, and dropout.

After the MultiHeadAttention layer, the output is normalized through Layer Normalization, which is a technique used to stabilize the training of deep neural networks.

The TransformerBlock also includes feedforward layers, which are fully connected neural networks used to learn non-linear relationships in the data. The feedforward layers receive the input and produce their own output, which is then added to the normalized output of the MultiHeadAttention layer.

Finally, dropout is used to prevent overfitting in the network by randomly dropping out some neurons during training. The output of the TransformerBlock is the result of the sum of the normalized MultiHeadAttention layer output and the feedforward layer output, with dropout applied to it.

The PositionalEncoding layer adds information about the position of each word in the sequence to the input. This allows the transformer to take into account the order of the words in the input sequence.

The Encoder is the overall encoder network, which includes an Embedding layer that converts the input sequence into dense vectors, a PositionalEncoding layer, and several TransformerBlock layers. The output of the Encoder is then fed into a classifier to predict the target classes.

Second: Implementing Decoder

CausalSelfAttention: This class implements the self-attention mechanism of the transformer. It takes in three inputs: q, k, and v, which represent the query, key, and value vectors, respectively. These vectors are first passed through linear layers to project them to a common space in order to make them comparable. This is done by multiplying the vectors with the weights of the linear layers and adding a bias term. This projection allows the model to learn a representation of the input that is useful for the specific task, and then the dot product of the query and key vectors is taken, scaled by the square root of the dimension of the key vectors, to obtain the attention scores, dot product which represent the similarity between the query and each position in the sequence of key vectors.

The dot product of two vectors represents the similarity between them. In the context of the query and sequence of key vectors, it means that the dot product is being used to calculate the similarity between the query vector and each individual vector in the sequence of key vectors. The resulting value of the dot product will be a scalar, and the higher the value, the more similar the two vectors are to each other. This information can be used to identify which positions in the sequence of key vectors are most similar to the query vector. A mask is applied to these scores to make sure that the attention is only applied to positions that come before the current position in the sequence, which is called causal masking.

Causal masking is a technique used in attention mechanisms to ensure that the attention is only applied to positions that come before the current position in the sequence. This is done by applying a mask to the attention scores, which sets the attention scores for positions after the current position to zero. This is important in situations where the order of the input sequence is important, such as in language modeling or machine translation. Finally, the attention scores are passed through a SoftMax function to obtain the attention weights, which are then used to weight the value vectors and produce the final output.

The Transformer model applies a multi-step process to the input before passing it through the self-attention mechanism. First, it applies layer normalization to the input. This is a technique used to normalize the inputs to a neural network, in order to improve the stability and performance of the model.

Next, the input is passed through the self-attention mechanism, which is used to weigh the importance of different parts of the input for a given task. The self-attention mechanism calculates dot product of the query and key vectors, scaled by the square root of the dimension of the key vectors, to obtain the attention scores. A mask is applied to these scores to make sure that the attention is only applied to positions that come before the current position in the sequence, which is called causal masking.

After that, the model applies another layer normalization to the output of the self-attention mechanism. Then, it passes the output through a feed-forward neural network. This network applies a linear transformation to the input, followed by a non-linear activation function, in order to extract higher-level features from the input. Finally, the model applies dropout, which is a regularization technique that randomly sets a certain percentage of the input units to zero during training, in order to reduce overfitting.

PositionalEncoding: This class implements the positional encoding that is added to the input vectors to indicate their position in the sequence. It is a learned representation of the position of the words in the input.

Decoder: This class takes the output of the last transformer block and uses it to generate the final output of the model. It includes a linear layer to project the output to the vocabulary space, and a final softmax function to obtain the probability distribution over the vocabulary for each position in the input sequence.

Transformer : This class is the main class of the transformer model, which ties all the other classes together. It includes an encoder and a decoder and it takes an input sequence and outputs the probability of each token in the output sequence.

Third: Implementing Decoder-Encoder

The MultiHeadAttention class is a multi-head self-attention mechanism that takes in 3 inputs: q, k and v, and an optional pad mask. It performs attention on the q and k matrices, and applies the attention weights to the v matrix to obtain the output. The class also has a causal option, which when enabled, applies a triangular mask to the attention scores to prevent the decoder from attending to future tokens.

The EncoderBlock class is a block of an encoder, it has a layer normalization, MultiHeadAttention and a feed forward neural network with dropout.

The DecoderBlock class is a block of a decoder, it has 3 layer normalization, two MultiHeadAttention one of which is causal, and a feed forward neural network with dropout.

In both Encoder and Decoder, the MultiHeadAttention is applied on the input x and the output from the MultiHeadAttention is added to the input and passed through a LayerNorm before being passed through the feed forward neural network.

Impact:

Transformers have had a major impact on NLP, leading to state-of-the-art results in many NLP tasks. They have improved the quality of machine translation, allowing for more accurate and fluent translations. In text classification, transformers have led to improved accuracy and faster training times. And in question answering, transformers have made it possible to answer complex questions with high accuracy.

Conclusion:

In conclusion, transformers have revolutionized NLP, leading to major improvements in many NLP tasks. The transformer architecture has become the standard for NLP models, and it is likely that transformers will continue to play a key role in the development of NLP models in the future. As NLP models continue to advance, we can expect even more impressive results and new applications for transformers in the future.

Contributors : Aram Abdalgani and Zaid Sallam

https://www.linkedin.com/in/zaid-sallam

https://www.linkedin.com/in/aram-abdalgani-25b2a4244