Neural Machine Translation: Demystifying Transformer Architecture in 6 min!

Santhosh Kumar R

Published in

The Startup

6 min readJul 31, 2020

what was before the pre Transformers era

Recurrent neural network (RNN)

A basic Seq2seq model consists of an encoder and decoder. The model takes input sentence with T tokens into the encoder and encodes information one word at a time and outputs a hidden state at every step that stores the sentence context till that point and passed on for encoding the next word. So the final hidden state (E[T]) at the end of the sentence stores the context of the entire sentence.

This final hidden state becomes the input for a decoder that produces translated sentence word by word. At each step, the decoder outputs a word and a hidden state(D[t]) which will be used for generating the next word.

But RNN suffers from the problem of vanishing gradients, making it ineffective for learning the context for long sequences.

RNN based translation with Attention

RNN model with Attention differs in the following things:

Instead of the last hidden state, all the states (E[0], E[1]…, E[T]) at every step along with the final context vector (E[T]) are passed into the decoder. The idea here is each hidden state is majorly associated with a certain word in the input sentence. using all the hidden state gives a better translation.
At every time step in the Decoding phase, scores are computed for every hidden state (E[t]) that stores how relevant is a particular hidden state in predicting a word at the current step(t). In this way, more importance is given to the hidden state that is relevant in predicting the current word.

ex: when predicting the 5th word more importance must be given to the 4th, 5th, or 6th input hidden states (depends on the language structure to be translated).

This method is a significant improvement over traditional RNN. But RNN lacks a parallelization capability (RNN have wait till the completion of t-1 steps to process at ‘t’th step) which makes it computationally inefficient especially when dealing with a huge corpus of text.

Since RNN’s nature does not allow for parallelization, it possible to drop the RNN and switch to more advanced architecture. And the answer is the Transformer.

Transformer Theory

The architecture looks complicated, but do not worry because it’s not. It is just different from the previous ones. It can be parallelized, unlike Attention And/or RNN as it doesn’t wait till all previous words are processed or encoded in the context vector.

Positional Encoding

The Transformer architecture does not process data sequentially. So, This layer is used to incorporate relative position information of words in the sentence. Each position has a unique positional vector which is predetermined not learned.

If you observe, the color schema is different for different positions and are functions of positions themselves. Hence these vectors can be used to represent word positions.

Attention Unit

In the transformer, there is no such concept as a hidden state. The transformer uses something called Self-attention that captures the association of a certain word with other words in the input sentence.

To explain in simple words, In the figure above the word ‘it’ associated with the word ‘The’ and ‘animal’ more than other words because the model has learned that ‘it’ is referred to ‘animal’ in this context. It is easy for us to tell this because of our Linguistic understanding(we have been trained for a long time). But for the transformer, one has to tell to put more focus on the word ‘animal’. Self-attention does that.

OK, How it does that?

This is achieved by three vectors Query, Key, and Value which are obtained by multiplying input word embedding with the unknown weight matrices Wq, Wk, Wv (to be estimated).

Then using the q, k, and v matrices, attention scores are computed for each word with other words using the formula:

After computing, attention scores would look like this

Here the Score column is the result of the dot product of query and key. So, another way of interpreting this is a query word looking for similar words (not strictly though as query and key are not the same). Therefore words with high scores have a high association. Softmax pushes scores value between 0 and 1 (think as weights). So, the final column is the result of the weighted average of value vectors. We can see how attention screens out nonrelevant inputs for encoding each word.

If you notice computations are independent of each other, hence they can be parallelized.

Till now we have seen single head attention. we can also use multiple sets of q, k, v for each word for computing individual Attention scores providing greater flexibility in understanding context, and Finally resulting matrices are concatenated as shown below. This is called Multi-head attention.

Attention Scores (vectors) are feed into a feed-forward network with weight matrices (Wo) to bring attention output dimension, the same as the input embedding dimension.

Note: As we pass input embedding through many layers positional information may be decay. To make up for this we add the input matrix to attention output (add & norm) to retain position-related information.

The final output from the encoder unit has the same dimension as an input encoder. These units can be stacked in series which makes the model more robust. Finally, the output from the last encoder unit becomes the input for the decoder.

Decoder

similarly, multiple decoders can be stacked in a series where the first decoder uses the output of the final encoder. The last decoder outputs predictions for words to be translated.

The encoder component needs to look at each word in the input sentence to understand the context but while decoding, let say predicting ‘i’ th, it should be only be allowed to look at previous i-1 words. So, Inputs to the decoder are passed through the Masked Multi-head Attention that prevents future words to be part of the attention.

The decoder has to relay on Encoder input for the understanding context of the complete sentence. It is achieved by allowing the decoder to query the encoded embeddings (key and value) that stores both positional and contextual information. Since the query changes at every step, so does Attention telling which words to focus on input words to predict the current word.

Outputs are passed through fully connected layers just like an encoder. Finally, the Linear layer expands the output dimension to vocabulary size (Linear), and softmax converts values to probabilities (between 0 and 1). word corresponding index of max probability will be output.

I hope this short article helped you in understanding Transformer architecture. Thank you for reading.

Note: images here are screenshots taken from a youtube video titled Transformer (Attention is all you need) by Minsuk Heo 허민석. and The Illustrated Transformer by Jay Alammar