Sliding Window Attention
Before we jump into sliding window attention, let’s review.
- Self-Attention
- Receptive Field
Self-Attention
Self-Attention allows the model to relate words to each other. Imagine we have the following sentence: “The cat is on a chair”
Here I show the product of the Q and the K matrix before we apply the softmax.
After applying the causal mask we apply the softmax, which makes the remaining values on the row in such a way that the row sums up to 1.
Now, let’s look at the sliding window attention.
Let’s apply sliding window attention
The sliding window size is 3.
After applying the softmax, all the –Infinity have become 0, while the other values in the row are changed in such a way that they sum up to one. The output of the softmax can be thought of as a probability distribution.
The output of the Self-Attention is a matrix of the same shape as the input sequence, but where each token now captures information about other tokens according to the mask applied. In our case, the last token of the output captures information about itself and
the two preceding tokens.
With a sliding window size 𝑊 = 3, every layer adds information about (𝑊 — 1) = 2 tokens.
This means that after 𝑁 layers, we will have an information flow in the order of 𝑊 × 𝑁.
Sliding Window Attention: details
- Reduces the number of dot-products to perform, and thus, performance during training and inference.
•Sliding window attention may lead to degradation in the performance of the model, as some “interactions” between tokens will not be captured.
•The model mostly focuses on the local context, which depending on the size of the window, is enough for most cases.
This makes sense if you think about a book: the words in a paragraph on chapter number 5 depend on the paragraphs in the same chapter but may be totally unrelated to the words used in chapter 1.
•Sliding window attention can still allow one token to watch tokens outside the window, using a reasoning similar to the receptive field in conventional neural networks.
Receptive field in CNNs
This feature depends directly on 9 features of the previous layers, but
indirectly on all the features of the initial layers, since each feature of
the intermediate layer depends on 9 features of the previous layer. This
means that a change in any of the features of the layer 1 will influence
this features as well.
It is very similar to the receptive field of a CNN.
Self-Attention during Next Token Prediction Task
Let’s break down into steps.
Self-Attention with KV-Cache
Rolling Buffer Cache
Since we are using Sliding Window Attention (with size W), we don’t need to keep all the previous tokens in the KV-Cache, but we can limit it to the latest W tokens.
Rolling Buffer Cache: how it works
Let’s add the sentence “The cat is on a chair”
We add a new token and move the pointer forward
Let’s add the sentence “The cat is on a chair”
We add a new token and move the pointer forward
Let’s add the sentence “The cat is on a chair”
We add a new token and move the pointer forward
Let’s add the sentence “The cat is on a chair”
We add a new token and move the pointer forward
Let’s add the sentence “The cat is on a chair”
We add a new token and move the pointer forward
Let’s add the sentence “The cat is on a chair”
Now, you can see since window size is 4 and new word ‘a’ is a fifth word so it is replace with the position of ‘The’.
Pre-fill and chunking
When generating text using a Language Model, we use a prompt and then generate tokens one by one using the previous tokens. When dealing with a KV-Cache, we first need to add all the prompt tokens to the KV-Cache so that we can then exploit it to generate the next
tokens.
Since the prompt is known in advance (we don’t need to generate it), we can prefill the KV-Cache using the tokens of the prompt. But what if the prompt is very big? We can either add one token at a time, but this can be time-consuming, otherwise we can add all the tokens of the prompt at once, but in that case the attention matrix (which is 𝑁 × 𝑁) may be very big and not fit in the memory.
The solution is to use pre-filling and chunking. Basically, we divide the prompt into chunks of a fixed size set to W (except the last one), where W is the size of the sliding window of the attention.
Imagine we have a large prompt, with a sliding window size of 𝑊 = 4.
For simplicity, let’s pretend that each word is a token.
Prompt: “Can you tell me who is the richest man in history”
Pre-fill and chunking: first chunk
Prompt: “Can you tell me who is the richest man in history”
At every step, we calculate the attention using the tokens of the
KV-Cache + the tokens of the current chunk as Keys and Values,
while only the tokens of the incoming chunk as Query.
During the first step of pre-fill, the KV-Cache is initially empty.
After calculating the attention, we add the tokens of the current
chunk to the KV-Cache. This is different from token generation in
which we first add the previously-generated token to the KV-Cache
and then calculate the attention. We will see layer why.
Pre-fill and chunking: second chunk
Prompt: “Can you tell me who is the richest man in history”