Decoding Transformers: The Self-Attention Saga

Himanshu Kale
6 min readFeb 21, 2024

--

Photo by Jared Rice on Unsplash

Are you also curious about how your favorite AI chatbot seems to understand you better each day? It’s all thanks to the Self-Attention Mechanism in Transformers. But fear not, you don’t need to be a tech wizard to grasp its magic. Let’s embark on a journey to know this groundbreaking technology, revealing how it’s reshaping our digital world one conversation at a time

Let’s start from scratch !!

Computers and our algorithms are primarily designed to work with numbers. So when we took an NLP task in our hand , the first thing we might do is converting those words in Embeddings. In simple terms, Embeddings are nothing but the numerical representations of the words that are actually n-dimensional vectors. Traditionally we used to adopt One-Hot-Encoding, Bag-of-Words, Embedding Layers to get these vectors. But have you ever thought of why there was the need to move from these Embedding vectors to Self-Attention ? The answer to this is “Static Embeddings”. The vectors generated by those methods were static in nature and if one word has different meanings in the in different context the embedding vector for it would be the same. Let us try to understand this with an example,

Contextually, the word ‘bank’ refers to different meaning for money and river, but the embedding vector for both of them is the same. So Semantic Embeddings for words is not a good option to move with, infact we need contextual embeddings that extracts contextual meaning of the word in the vector. That is what the Self Attention Block in the Transformers do.
[ Semantic Static Embedding → Dynamic Contextual Embedding ]

Now as we have understood the why for the self-attention, Let’s continue this with the how , that means how are these conversions made and what is the Math behind it.

First Principle Approach
For the discussion further let us consider two small sentences :
1. money bank grows
2. river bank flows
In both sentence the word bank holds a different meaning. So as discussed for both these sentences the word bank must hold different embedding vectors. If you can observe the meaning of the word ‘bank’ is dependent on the context of the words around it. So using principle of linear combination we can write the embeddings as below ,

So from these equations we can get the new contextual embeddings for the words. Now you might be wondering !!! What are these numbers 0.7 , 0.2 …..
Actually these are the numerical representation of the similarity between the static embeddings of the words, that is 0.7 is the similarity of the embedding of money with itself , 0.2 is the similarity of the embeddings of money and bank and so on !! But does it look like similarity and why only similarity ?? Let me come to the question 2 first similarity represents the dependency of the embeddings upon each other and contribution in the dynamic embedding and the number that you see to be less than 1 is just because we AI enthusiast know the power of normalization !! Sounds good now I think !! Let’s move on !!

The similarities between the vectors can be easily calculated by dot product between the static embeddings of the two words. These similarities are normalized to get the percentage contribution of the word in the new embedding. As these numbers can be negative as well these similarity scores are sent through the softmax as you can see below to get the final new dynamic contextual embedding.

Any calculation happening in this process is independent on the other so there is no reason of not using parallel processing of the calculation. This I feel is the biggest advantage when it comes to heavy computation algorithms. Just take a look at the below figure to get more into the parallel operations.

But there is one issue arising !! As we are doing parallel processing of the embeddings, the sequential information of the data is lost . Also in the complete process there are no learning parameters involved. But is there any need to get trainable parameters in the process and why can’t we keep them simple ?

If you consider a Machine Translation task you would notice that even if the context of the words are the same but the meaning coming out from it is different and is dependent on the data. So , to generate a data driven embeddings is actually a good idea that is task specific contextual embeddings.

If you look at algorithm that we used for calculation of the contextual embeddings, you can see the embedding of money , bank and grows is performing three different task to compute contextual embedding. Does that sounds cool to you !! Philosophically we as humans usually tend to show different characters and info of ours facing different tasks and those characters are a part of our entire character. So if we can use these concept with our embedding vectors that can be really cool !!
So our algorithms get modified to the below one :

These three embeddings coming out from a single embedding are called
1. Key
2. Query
3. Value
all three serving different tasks.

But how to come to these vectors from a single ?
I see two options :
1. Scaling the original vector
2. Linear Transformation using Matrices
I would definitely not go for the first one because the direction would still remain the same. So Linear Transformation is what we can move with and this will also allow trainable parameters to come in picture .

The matrices Wq, Wk and Wv are the weight matrices and can be used as the linear transformers. The numbers that come in these matrices are learned during the training of the model and fulfills our data dependent embedding generation.
These Matrices remain same for the other words also so if you computing the dynamic embedding for bank and grows as well the same Wq, Wk and Wv matrices will be used. Now we have figured out everything !! Is the algorithm still parallelizable ? Yes it is look at the complete algorithm in the figure below ,

So that’s it I think, we have found a way to convert the static embeddings to dynamic and those dynamic into task dependent contextual embeddings.
Self-Attention Mechanism in Transformers isn’t just about algorithms; it’s about unlocking the potential for machines to truly comprehend language.
That’s all for this post !!

References
1. Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
2. Attention Mechanism from Scratch
Jason Brownliee @ Machine Learning Mastery

--

--

Himanshu Kale

Associate Data Scientist @ Neurologic AI Systems Pvt. Ltd.| Masters from IIT Kharagpur | Machine Learning and Deep Learning Enthusiast