Implementing Transformer Paper (Google T5 Transformer from Scratch and using it to create a Chatbot)

Arnav Gupta
Analytics Vidhya
Published in
8 min readAug 20, 2021

This article comprehensively discusses about using Googles T5 Transformer using Tensorflow.

To directly use it on the fly, refer to my article-
3 Steps to have a running chatbot using T5 — which is a 3 step tutorial on how to use my library to create a contextual chatbot (No deep learning required because I did it for you) and deploy it on Reddit/Telegram/Mobile Applications

Brief Intro about T5

From T5 Transformers paper

T5 Transformer’s Paper Link

What is a Transformer?

Transformer is a language model which is position-aware feed forward neural networks.

The exact inner workings of attention are irrelevant as long as you understand that attention computes similarity between vectors.

They “read” the whole sentence at once. Each word gets represented given it’s own position and all the others words in the sentence and their positions.

Note that the default implementation assumes a maximum sequence length (unlike RNNs). For infinite/very long sequences, a different architecture (Transformer-XL) is needed.

Overall, the Transformer actually knows the order of the words as well, these are encoded in a separate positional vector. The positional vector is then merged with the vector that represents the similarities between each word and after that it is passed to a feed forward layer in the encoder.

In this article I wont got into much detail of it’s working and nuances you can definitely follow (http://www.peterbloem.nl/blog/transformers) Easily the best explanation I’ve found.

What does T5 Transformer do?

Recent years have seen a plethora of pre-trained models such as ULMFiT, BERT, GPT, etc being open-sourced to the NLP community.
One of the latest and SOTA being T5: Text-to-Text-Transfer-Transformers Model which was open-sourced near December 2019.

T5 transformers can fit multiple text class because it reframes all NLP tasks into a unified text-to-text-format where the input and output are always text strings.

The model is pre-trained over 700GB in size of cleaned version Common Crawl dataset.

Like BERT, T5 also is Masked Language Model.

But the key difference in BERT and T5 is:
— BERT replaces single token to single masked token
— T5 replaces multiple token to a single masked token

So as an output we expect a sequence over a token in case of T5.

T5 transformer can be fine tuned to any fo the tasks like Chatbot, Translation, Text Summarisation, Sentence Similarity, etc.

Transformer NMT Chatbot

Attention Layer

In attention, we basically take two word embeddings (x and y), pass one through a Query transformation matrix (Q) and the second through a Key transformation matrix (K), and compare how similar the resulting query and key vectors are by their dot product.

We start creating a MultiHead Attention layer (MHA)

https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Creating MultiHeadAttention class using Tensorflow Keras

The V, K and Q here depicts Value , Key and Query.

Q is from the target sequence and K,V are from the source sequence

All three parameters are similar in structure, with each word in the sequence represented by a vector.

As per image we take in V, K and Q input tokens and build a these d_model (in our case 512) size vector.

We split the q, k, v into size equivalent to number of tokens in q.

The split function splits the last dimension into (num_heads, depth).
Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)

Scaled Dot Product Attention

Here we calculate the attention weights

q, k, v must have matching leading dimensions.

k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v. The mask has different shapes depending on its type(padding or look ahead) but it must be broadcastable for addition.

We get this by applying the above formula.

dk in the formula is penultimate dimension of k or v i.e.: seq_len_k.

So hereby following the Multi Head Attention’s build we call scaled_dot_product_attention to get the weights to get scaled_attention, attention_weights using q, k and v.

We further transpose and scale it to get our concatenated attention vector.

And pass it through a linear layer which gives us the attention output vector representation of the sequence.

Final implementation of MHA

Defining Encoding Layer

Now we build an Encoder Layer for T5 Transformer.

For the Encoder Layer,

We take in the encoded representation of our text and and pass it through MHA to get attention ouput as we devised above.

Then we add some dropouts and normalize the sum of attention outputs and inserted vector and it through a feed forward network which is 2 sequential linear layers and repeat dropouts and normalization part over that.

Positioning Encoding

So this thing always confused me.

Positional embeddings is that it is preferable to add them to the word embeddings instead of concatenating them. We already know the dimensions of the word embeddings are related to semantics. So why embed position into the semantic space instead of adding additional dimensions to represent position?

Suppose we have list of words. We select any 2.

Suppose x and y are selected. Let positions of x and y are e and f respectively.

Questions:
1) How much attention should we pay to word x given word y?

2) How much attention should we pay to word x given the position f of word y?

3) How much attention should we pay to y given the position e of word x?

4) How much attention should we pay to the position e of word x given the position f of word y?

After reading about it on forums this is what I get.

The learned transformation matrix Q’K with positional encodings has to do all four of these tasks simultaneously. This is the part that may appear inefficient, since intuitively, there should be a trade-off in the ability of Q’K to do four tasks simultaneously and well.

Concatenation would ensure that the positional dimensions are orthogonal to the word dimensions, but my guess is that, because these embedding spaces are so high dimensional, you can get approximate orthogonality for free even when adding, without the costs of concatenation (many more parameters to learn). Adding layers would only help with this, by allowing for nonlinearities.

We also ultimately want e and f to behave in some nice ways, so that there’s some kind of “closeness” in the vector representation with respect to small changes in positions. The sin and cos representation is nice since nearby positions have high similarity in their positional encodings, which may make it easier to learn transformations that “preserve” this desired closeness.

TLDR:

It is intuitively possible that, in high dimensions, the word vectors form a smaller dimensional subspace within the full embedding space, and the positional vectors form a different smaller dimensional subspace approximately orthogonal to the one spanned by word vectors. Thus despite vector addition, the two subspaces can be manipulated essentially independently of each other by some single learned transformation. Thus, concatenation doesn’t add much, but greatly increases cost in terms of parameters to learn.

Encoder

Now using the Positional Embeddings and Token Embeddings, we pass it through multiple Encoder Layers.

We finally get our encoded representation of the sequence.

Decoder Layer

For the decoder layer,

Similar to Encoder, We take in the target sequence feed it to the Output Embedding and Position Encoding, which produces an encoded representation.

Changes here are:

We need a Look-ahead Mask because while generating target sequences at the decoder, since the Transformer uses self-attention, it tends to include all the words from the decoder inputs.

But, practically this is incorrect. Only the words preceding the current word may contribute to the generation of the next word. MHA ensures this. The working of the look-ahead mask is explained in the adjacent figure.

In Decoder we have 2 MHA layers.

First taking the above vector with the look ahead mask and second taking the normalised form of positioned-target vector and encoded output vector.

Decoder

Now using the Positional Embeddings and Token Embeddings of Target, we pass it through multiple Decoder Layers along with Look Ahead Mask.

Transformer

Now we define our Transformer, which is basically stacking up our Encoder and Decoder Layer.

Over the top we add a Dense layer which outputs the sequence needed.

Load Create Tokenizer:

For Tokenizers we use load Tensorflow Subword Encoder

The main advantage of a subword tokenizer is that it interpolates between word-based and character-based tokenization. Common words get a slot in the vocabulary, but the tokenizer can fall back to word pieces and individual characters for unknown words.

We build our input and output tokenizers from the training data we give in.

Custom Scheduler

We use a custom learning rate scheduler for our task. It takes in to account the warmup steps and then the minimum of reciprocal of square root of number of steps or the arg2(warmup steps) as below.

Config loader

For training parameters we make yml file as below and specify the params. We put in all the configurable parameters here.

Config file:

Settle down we have built our own transformer.

Training Driver

Now that we are done with our model building.

We take in everything and wrap the driver.

We set our learning rate based on the CustomScheduler we created.

Optimizer => In my experience, on Transformer-style models, Adam indeed does seem to train faster than SGD with momentum, even after tuning momentum well. See the 4th panel of Figure 2 in https://arxiv.org/abs/1910.05446.

We use tf.keras.metrics.Mean for measuring train_loss.

We use tf.keras.metrics.SparseCategoricalAccuracy for measuring train_accuracy.

The @tf.function trace-compiles train_step into a TF graph for faster execution. The function specializes to the precise shape of the argument tensors. To avoid re-tracing due to the variable sequence lengths or variable batch sizes (the last batch is smaller), use input_signature to specify more generic shapes.

Doubts, suggestions, reach me out.

My LinkedIn

--

--