NLP Transformer Testing
In Machine Learning, it is hard to visualize or test something small.
When it is NLP, domain is natural language and this task becomes more difficult.
I tried to create a very very small sample dataset to test transformer in simplest way. I took the base code from a tutorial and changed the paths and variable names according to my intentions for debugging.
I get the base code from here . But I changed so much that you cannot match the lines probably. The code is at github, but use nbviewer.com to view to see colors.
Code link in nbviewer
Code link in github
I will be focused on testing the learning phase of transformer.
If you do not know topic ,read good tutorials about transformer please. This is just a visualization about transformer architecture for better understanding. In following 2 posts you will see about self-attention and Q,K,V in transformer logic.
Part 2 : Self-attention
Part 3 : Q,K,V
Also you must know encoder-decoder architectures and encoder-decoder with ATTENTION architectures to understand transformer. I usually use the word vector instead of tensor in the post.
The dataset contains English to German sentences. It is a balanced set, every word appears in same frequency in his group(subject, object, verb). There is a “want to” pattern ,which leads different length of translation. Also “can” and “want” patterns lead different order in translation, so translation must also learn the position also. And also simple patterns “i eat apple” , generate different order for verb + object order. Even if you do not know German words are so simple. You can check below table and see which word corresponds to which English word. Also sentences are super simplified, not correct(missing artikels …) You can in fact change the 2nd column put your language and try. (must also change german words in the notebook with words in your language. I will do a sample for Japanese)
What makes transformer different than other network architectures is self-attention. In a neural network we have an input domain and output domain, and we train network for a mapping.
In an NLP task we have sentences as input and we give these as input and we usually have an Embedding layer to create vector from these. In pytorch , torch.nn.Embedding is a trainable lookup table. Trainable means it will model relations among words better through training.
If we want to capture every aspect of a word we must create a table as above, and have thousand of columns to perfectly create vectors of every word. In an Embedding layer we do this with a low dimension(64,128,512…) by training a network. Here Embedding layer is updating it’s vectors at every epoch by checking loss.
Think I put all the words in English into columns. Let’s take word “venus”, did anyone in history “I eat apple in Venus”, normally there must not be, so under the column “Venus” there must be 0 or some very very low value for smoothing. So training a language model is in fact trying to understand these relations for using as a starting point.
On transformer we still have Embedding layer, but we also utilize another method for creating vectors of words.(Q,K,V we use embedding as the beginning vector and try to learn better ones). If we generate vectors according to context ,then we capture human-like understanding of solution space. That is what humans do right? When we see “apple” in a sentence, we can expect that sentence to have a verb “eat”. So when you train a network with lots of sentences it will model the relationship among words.
The transformer architecture has a self-attention, which creates vectors from sentence itself. So at self-attention , input and output domain is same.(In encoder-self attention, this is totally true, decoder-self attention also true, but at encoder-decoder attention layer is a bit different. Check MultiHeadAttentionLayer code.
In self attention layer we try to learn 3 weights(Linear Layer).(Q,K,V)
We train these networks by feeding data and try to get a good representation for each word based on it’s sentence. Let’s say we have 2 sentences.
-we eat apple
-we eat bread
We will generate vectors for [we,eat,apple,bread] and generate vectors for sentences. But “eating bread” must be a different action than “eating apple”. So at those steps vector generated by network must be slightly different. Basically we are trying to learn good vector representations according to context.
Vector[we,eat,apple] = Vector[we],Vector[eat],Vector[apple]
Vector[we,eat,bread] = Vector[we],Vector[eat],Vector[bread]
Think “eat” word, even it is same word, it must have different, separable vectors at different sentences.
Another way to think is , in my sets “can” appears with lots of words, but “apple” only appears with “eat”, so “eat” and “apple” must add more to each other’s representation than “can”. This is in fact how humans think, we capture the meaning of a word in a sentence , by checking the other words in sentence. If I want you to fill the following sentence “i want to ___ apple” , you will not say, “i want to kick apple” or “i want to read apple”, because you never saw samples like this. That is totally what we are doing here with vector mathematics.
How Transformer learns?
For showing the learning process, I create a utility class to log vector values at every epoch.
If you check the MultiHeadAttentionLayer class, there are 4 important vectors. Q,K,V,X1.
X1 is the final vector of MultiHeadAttentionLayer.(Check code please, explaining everything will be so long.)
So I log all intermediate steps of training for important vectors Q,K,V,X1… . Training has 30 epochs. So epochs are 1… 29,30. Below picture shows these vectors over time.
These vectors are in fact 64 dimensional. I reduce them to 2 or 3 dimensions for visualizing. Below image shows how they are changing positions for best positions in space. As you see, by training, network changes vectors slightly to learn better. As I said, training is learning weight, and learning how to convert inputs to best vector for minimum loss.
Each sentence has different colors. You will see items as below in graph.
we@X1@30
means :
Word : we
Vector :X1
At Epoch: 30
I get step 1,15 and 30(just try more too see)
Check how we@X1@1, we@X1@15,we@X1@30 change position. The image below shows “eat”.
Check the Q,K,V and X1 clusters. Can you see that clusters are near each others. Open images in new tab, you need to see full picture. Also you can play with the code to see more steps or all steps.(Search the section #VECTORS CHANGING OVER TIME )
Above u can see that, network make little changes on Q,K,V to make X1 different for different context. Although 1st epoch values are random, final results do not change too much for Q,K,V because we make small updates with learning rate, but X1 can have big changes, since it is calculated with combination of these values.( Softmax( (K*V)/Scale) ) * V )
Let me show you all in 1 graph.
Now let’s play with a different vector. src_final. In Encoder layer src_final is last vector.(Check my code for naming) You can think it as last output of layer. So it means how you summarize your input sentence. “i can eat apple” is for humans. Computers express this with vectors. So every dimension of a vector represents some feature.
test_sentences = [
"i can eat apple",
"i can eat bread",
"i can eat book",
"i can eat newspaper",
"i can eat apple book"]
I created 5 test sentences. As you see 1st and 2nd sentences are valid sentences. At 3rd sentence “eat book” and 4th “eat newspaper” are nonsense sentences. At 5th sentence both “apple” and “book” exist. Translations of above 5 sentences are as below:(3,4,5 translations are not correct)
translation ['ich', 'konnen', 'apfel', 'essen', '<eos>']
translation ['ich', 'konnen', 'brot', 'essen', '<eos>']
translation ['ich', 'konnen', 'apfel', 'essen', '<eos>']
translation ['ich', 'konnen', 'brot', 'essen', '<eos>']
translation ['ich', 'konnen', 'apfel', 'essen', '<eos>']
They give above translations. As you see “eat” is dominant at bad sentences and they produce sentences with eat.(Network could also generate read book, but verb is more dominant.)
Now let’s check how network generated vectors for these.
As you can see sentences having proper words(1 and 2) and 5(includes same words) have similar vectors. Observe that “can” vectors position in similar positions. So even we generate vector for same word “eat” , we position on different part of space. Also “can” vectors form a more compact cluster.
HOW NETWORK LEARNS WITH TRAINING
Now let’s try to visualize training process for attention. I will dump the training steps attention scores through steps. At training step 1 you will see attentions are random. By time it gets better and at last step it is putting attention to appropriate places. At below code, you can see how i collect attention vectors for epochs(0,10,20,30), if u want to see more or different attention vector on different epochs, play with line 5.
At below image ,you can see the change of attention vector on translation steps.
Columns[sos, we, can, eat, apple, eos, pad] are source words.
Indexes [1)sos-wir, 2)wir->konnen, 3)konnen->apfel, 4)apfel->essen, 5)essen->eos] show translated words each step.
You must think it as last word generated -> word generated at this step.
“1)sos->wir” : means translation step 1, going from “sos” to “wir” , implies 1st word prediction in translation, at initial step we only have “sos” and we predict “wir”.
At epoch 1 ,u can we see that attention is on “sos”. If you check epoch 30(last epoch) attention is on “we”. Check for the other epochs.
At last epoch after “essen”, “sos” seems to have more attention. But never forget attention is not the only vector we are using to generate final output. So let me dump the translation process after training. Let’s print translation info.
I have a method called translate_info which shows the logits and attentions of translation. Logits is the raw(non-normalized) scores for classification model before Softmax. Then softmax generates probabilities that sum to 1 with these scores.
Translation is the process of generating next word from the input “sos” until the output “eos” generated(or to max length we try). So when you see an image as below, it shows the logits and attentions in each translation step. So if input sentence is “i can eat apple”, translation process has 5 steps, “ich konnen apfel essen <eos>”.
As you can see below,at last step although attention is more on “sos”, Logit shows “eos” with highest point. Table shows Logits at each steps, and heatmap shows attention at each step(of translation process).
IMPORTANCE OF SELF-ATTENTION
As stated before, transformer is an improvement over encoder-decoder architecture. The gist of Transformer is using encoder states ,in optimum way in each state. As you see at the above heatmap, it changed the attention on every step of translation process. Now I will try to visualize the importance of attention mechanism.
I will translate all sentences in normal way using default attention mechanism(Code below Line 25).
Then I will translate using equal attention( we do not care the vectors we learned through training, and think all words have equal importance from source sentence) for all encoder states(Code below Line 27)(You can see details of this logic in 2nd part of this tutorial)
The vector I am using is “trg4” vector at the end of DecoderLayer. Also think that I am getting all intermediate steps of a translation process. For a translation [ich,mochten,apfel,essen,eos] 5 vectors are created. I also check length of generated sentence.(Line 14).
But when i do not apply attention in proper way, sentence order can change. If you check below method, i send filter words. It means, at tranlation step 1 we expect network to generate, “ich” or “wir”. For invalid attention applied sentences, I filter only sentences which created “ich” or “wir”. Here I try to show, even if network generates the same word, it’s encoding is different. I am doing this not “comparing apples and oranges”. The vectors I am comparing are of same type and symbolizes same thing.( I am only comparing [ich,wir] valid vectors(with self-attention) with [ich,wir] invalid vectors(equal attention).
Also in all sentences, I mark invalid sentences with pink, and end of invalid sentences have suffix as below:
eat : “!!!”
drink : ”]]]”
read : “<<<”
Below 1st image is for translation step 1.What do we generate normally in translation step 1, either “ich” or “wir”. So you can see left part is “ich”(i) and right part is “wir”(we). Network nicely separated these vectors. (I use source sentence as label, that gives more idea)
At below image, when we include invalids,you can see they are positioned in middle. Not separated enough. Also even attention not applied properly, network still generated “ich” or “wir” at 1st step, but as you see their representation is very different from others.
Let’s state below image verbally :
Without attention
Even if we create,
the same vectors[ich,wir]
at same step(1step of translation )
those vectors[ich,wir] are different than the same vectors[ich,wir]
created with attention.
At translation step 2 we have [“konnen,”mochten”] so below image shows how network learned to separate them. You can see “can” at right, “want” at left.
Below we see when we did not apply self-attention, only few sentences generated [konnen,mochten] and as you see they are far from normal clusters.
Without attention
Even if we create,
the same vectors[konnen,mochten]
at same step(2step of translation )
vectors[konnen,mochten] are different than the same vectors[konnen,mochten]
created with attention.
At translation step 3 we have [“buch,”zeitung”,”apfel”,”brot”,”wasser”,”beer”] so below image shows how network learned to separate them. You can see object clusters.
At below image, when invalids added, you can see how pink sentences are scattered.
At 4 step we generate verbs. As you see in below they are separated in below image. Even you can see verbs are separated by verb + object.
When we add invalids, you can see pinks are scattered over image.
As you see when we do not apply transformer self-attention(distribute attention equally to all words at all translation steps) the vectors representing the same words were corrupted. Because of lack of good attentions, they are not separated enough. They did not catch the context.
In this post I tried to show what Transformer architecture is learning and how can we get a better idea about it. Read the posts about attention and Q,K,V to get more idea in more detail.