100 NLP Questions & ANSWERS| Attention part| Attention mechanism interview

Milana Shkhanukova
10 min readSep 16, 2024

--

Today we are going to cover probably one of the most important parts for an interview: attention mechanism.

32. How do you compute attention? (additional: for what task was it proposed, and why?)

Attention mechanism is a method that determines the relative importance of each component in a sequence relative to the other components in that sequence.

The phrase to remember is “you compare the query with the keys to get the scores/weights for the values”.

  • key — the tokens comparing to which we calculate (original text)
  • query — the tokens for which we calculate (target text)
  • value — tokens on which we update (original text)

We need to create such vectors for values so it takes only the important information from either itself (self-attention) or the target language (cross-attention).

The formula to calculate attention can be different. The important idea is that we need some scores to update our values.

  1. dot-product — the most commonly used formula
    It is also extended to scaled dot-product when you add softmax and normalize over the square root of embedding dimension. It is essential for stable training when the dimensionality is high.
  2. additive attention — concat/soft and global, attention formula suggested in Bahdanau, where the score is calculated using another feedforward network.

33. What is attention complexity? Compare it to RNN complexity.

Attention complexity is O(n**n)

RNN complexity is (n * d * d)

In general, RNN is more computationaly effective for longer sequences

https://arxiv.org/pdf/1706.03762.pdf

In terms of computational complexity, self-attention layers are faster than recurrent layers when the sequence length n is smaller than the representation dimensionality d, which is most often the case with sentence representations used by state-of-the-art models in machine translations, such as word-piece [38] and byte-pair [31] representations.

34. Compare RNN and attention. In what cases are you going to use RNN and in what cases do you prefer attention?

RNN:

  • Sequential processing: sentences must be processed word by word. → NO parallel computations
  • Past information retained through past hidden states: sequence to sequence models follow the Markov property: each state is assumed to be dependent only on the previously seen state → Vanishing gradient problem

Attention:

  • Non sequential: sentences are processed as a whole rather than word by word → Parallel computation

It is not a trivial to say what algorithm is preferred over another. However, it is undoubtful that nowadays attention neural networks are superior over the RNN ones.

35. Write attention formula in Python.

def attention(query, keys, values):
d = keys.size(-1)

scores = torch.bmm(query, keys.transpose(1, 2) / sqrt(d)
scores = F.softmax(scores)

values_emb = torch.bmm(scores, values)
return values_emb

36. Explain masking in attention.

Shortly, in encoder transformer part we use only self-attention as we need to learn the information about ourselves. In decoder part we first learn about ourselves with self-attention and then add the information from encoder with cross-attention.

Speaking in this matter, we have two types of masking: random and future tokens masking. In self-attention we use random masking. In cross-attention we use future tokens masking.

However, it is also important to remember about padding. If you use them both in self-attention or in cross-attention parts, you need to care about them in masked softmax.
Here is one of the implementations.

interactive visualization of masking for BERT and GPT embeddings
interpretation paper

37. What is attention matrix size?

The attention matrix depends on the number of input elements. (seq_length, seq_length), where seq_length is the length of the sequence that we apply attention to.

38. What is the difference between BERT and GPT when we are speaking amount attention calculations?

  1. BERT uses self-attention to calculate the hidden state of each token comparing to all other in the sequence using random masking. In addition, we can use span masking or different assumption calculating the coexistence of two tokens to mask them together.
  2. GPT uses casual language modelling where we mask future token for calculating the token n.

39. What is the dimensionality of an embedding layer in transformers?

vocab size * embedding dim
where:
1. vocab size — the number of unique tokens that the model understands, we get the tokens from the tokenizer techniques

2. embedding dimension — the number of features to get the hidden representation for each token. This dimension is usually fixed in transformer model.

Important note: one of the reasons we use the tokenizer to better split the words is that we cannot update the weights (give as input to the model the matrix all world words * embedding dim) for all the words in the world

40. Why do you call embeddings contextual? How does it work?

Let’s take word mouse. Depending on the context, it can mean different things: Mickey Mouse, animal mouse or computer mouse.
When we were using word2vec, we would store three meanings of the word or hope that we will cover all the contexts during training or the embedding covers all meanings at the same time.

However, in attention based models all the tokens are updated according to the context around them. Therefore, given “Mickey Mouse is my favourite Disney character,” we update the embedding of token mouse comparing to all other tokens in a sent. This happens during inference too, however during inference we do not update the weights to better learn the embeddings. Moreover, it is not necessary that we were given the same context during training, if the context itself is good trained — we can guess the meaning better.

In addition, in terms on the different types of embeddings we need to distinguish between:

  • segment embeddings — usually used for text similarity tasks where we split the sentences between each other. It was used essentially in BERT pre-training stage.
  • token embeddings — the embeddings we’ve discussed so fast. They have the meaning of each token we pass
  • positional embeddings — embeddings that help us to identify the position of each token

41. What do you use in Transformers layer norm or batch norm and why?

Beforehand, let’s define why we even need any kind of normalisation. The problem is internal covariate shift.

Internal covariate shift — the change in the distribution of network activations dues to the change in network parameters during training. As we apply activation functions on all the changed with waits inputs, it is important for use to learn valuable information from it. However, if the distribution of these values is too high or too low, we can arrive at close to 0 derivative. It eventually leads us towards the longer coveredge.

You can read more about it in this post.

To prevent internal covariate shift during training, we can apply batch normalization and layer normalization.

  • Batch normalization — another layer, which changes the distribution according to the mean and variance of the mini-batch. Then it makes a scale and shift on the dependent variables betta and gamma.

Authors added betta and gamma to “save the layer knowledge” and therefore make each layer continue solving its tasks, while the distributions are going to be close but not identical. You can learn more about it in this paper, read about the reasons why it works, its pros and cons.

Layer normalization is the same type of normalisation according to the mean and variance, however the statistics are calculated according to the whole layer and not just batch.
a video about it

As it can be drawn from the notion layer normalization does not hold any dependence towards batch size. In addition, layer normalization can be easily used in RNN between time steps, while batch normalization should be changed. Here is a discussion about it.

One of the reasons for the preference towards Layer Normalization is how easy it is to parallelize the computations. In batch normalization parallelization is more difficult as we have the dependence between elements. Here is a discussion.

Moreover, in batch normalization, there’s often a discrepancy between how statistics are computed during training (over a batch) vs. inference (using moving averages). While in NLP we usually have a small batch size on inference. In addition, layer normalization handles padding that can be applied in NLP texts better.

42. What is the difference between PreNorm and PostNorm?

Answer according to the 2020 article and the 2022 blog post

Previously, it was suggested that the Pre-LN Transformer outperforms the Post-LN Transformer when the number of layers increases.

Initially, PreLN and PostLN differ in the current settings:

  1. PostLN sets the LayerNorm between the residual blocks. PreLN sets the LayerNorm inside the residual blocks
  2. PreLN, in addition, sets the final layer normalization right before the prediction.
  3. The gradients of pre-norm are dependent on the layer number, while post-norm is independent of the layer.
  4. The same scalability issues are present for the scale of hidden state: Pre-LN is dependent on the layer number, while Post-LN is independent.
  5. The same behaviour can be seen on the gradient expectation for the weights. It is seen that Pre-Layer normalization has more stability over the layers.

It’s important that the gradient scale in PostLN can be a reason for more difficult training with lr scheduling. Since many gradients are very large for certain layers, using a large lr without warm-up will be unstable. This can be observed in the picture: after warm-up for PostLN, the scale for PostLN became small and can be subjected to high lr. Meanwhile, for PreNorm, warm-up is not as important. Moreover, this allows the model to converge faster.

43. Explain the difference between soft and hard (local and global) attention

Classical attention 2017 has the main advantage — look at others to update myself from a perspective of one token. However, it can be computationally expensive for very long sequences. There are two branches to optimize it:

  • soft VS hard — approach was suggested in image captioning, where in hard attention we predict one patch, but optimization is done with variance reduction or reinforcement learning
  • global vs local — in global attention we take all states to compare and update, while in local attention we predict the position on which we are going to look at. In NLP we usually prefer global vs local approach.

report about global and local

44. Explain multi-head attention

In attention, we can split our entire embedding and pass each part through different matrices — basically, this is multi-head attention, where a head is precisely that split. The advantages we get from this are parallelism and diverse representations. It is believed that each head can learn separate important information. The results of these independent attention mechanisms are then concatenated and linearly transformed into the required dimension.
report about multihead attention

45. What other types of attention do you know?

Each decoding step in autoregressive models like Transformers requires loading decoder weights along with all attention keys and values. This process is not only computationally intensive but also memory bandwidth-intensive. As model sizes grow, this overhead also increases, making scaling up an increasingly arduous task.

One of the branches of different types of attention is optimization:

  1. memory-bandwith problem — problem with Multi-Head Attention, each attention head computes its own unique set of query, key, and value vectors.
    a nice explanation
  • Multi-Query Attention — only the query vectors to each head get updated, while the key and value are shared across all the heads.
  • Grouped Query Attention — in Multi-Head Attention, the number of unique Key and Value vectors is equal to the number of attention heads; in Multi-Query Attention, the number of unique Key and Value vectors is equal to 1.

2. Some tokens do not communicate between each other — introduce fixed pattern and sliding window attention or routing mechanisms

3. Length complexity — first make low-rank for the tokens (LinFormer)

4. Compute power — Flash attention is based on the idea of optimizing computations on SRAM. HBM is used to store tensors (e.g., feature maps/activations), while SRAM is used to perform compute operations on those tensors.
Standard attention mechanism uses High Bandwidth Memory (HBM) to store, read and write keys, queries and values. HBM is large in memory, but slow in processing, meanwhile SRAM is smaller in memory, but faster in operations. In the standard attention implementation, the cost of loading and writing keys, queries, and values from HBM is high. It loads keys, queries, and values from HBM to GPU on-chip SRAM, performs a single step of the attention mechanism, writes it back to HBM, and repeats this for every single attention step. Instead, Flash Attention loads keys, queries, and values once, fuses the operations of the attention mechanism, and writes them back.
https://huggingface.co/docs/text-generation-inference/conceptual/flash_attention

46. How much more complex will multihead-attention become when increasing the number of heads?

According to the standard implementation, the number of heads does not increase the number of parameters.

A model of dimensionality d with a single attention head would project embeddings to a single triplet of d-dimensional query, key and value tensors (each projection counting d2 parameters, excluding biases, for a total of 3d2). A model of the same dimensionality with k attention heads would project embeddings to k triplets of d/k-dimensional query, key and value tensors (each projection counting d×d/k=d2/k parameters, excluding biases, for a total of 3kd2/k=3d2).

Thank you for reading! Write in the comments with what answers you disagree and what questions you want to add.

--

--