Sliding Window Attention

6 min readFeb 6, 2024

Before we jump into sliding window attention, let’s review.

Self-Attention
Receptive Field

Self-Attention

Self-Attention allows the model to relate words to each other. Imagine we have the following sentence: “The cat is on a chair”
Here I show the product of the Q and the K matrix before we apply the softmax.

After applying the causal mask we apply the softmax, which makes the remaining values on the row in such a way that the row sums up to 1.
Now, let’s look at the sliding window attention.

Let’s apply sliding window attention
The sliding window size is 3.

After applying the softmax, all the –Infinity have become 0, while the other values in the row are changed in such a way that they sum up to one. The output of the softmax can be thought of as a probability distribution.

The output of the Self-Attention is a matrix of the same shape as the input sequence, but where each token now captures information about other tokens according to the mask applied. In our case, the last token of the output captures information about itself and
the two preceding tokens.

With a sliding window size 𝑊 = 3, every layer adds information about (𝑊 — 1) = 2 tokens.
This means that after 𝑁 layers, we will have an information flow in the order of 𝑊 × 𝑁.

Sliding Window Attention: details

Reduces the number of dot-products to perform, and thus, performance during training and inference.
•Sliding window attention may lead to degradation in the performance of the model, as some “interactions” between tokens will not be captured.
•The model mostly focuses on the local context, which depending on the size of the window, is enough for most cases.
This makes sense if you think about a book: the words in a paragraph on chapter number 5 depend on the paragraphs in the same chapter but may be totally unrelated to the words used in chapter 1.
•Sliding window attention can still allow one token to watch tokens outside the window, using a reasoning similar to the receptive field in conventional neural networks.

Receptive field in CNNs

This feature depends directly on 9 features of the previous layers, but
indirectly on all the features of the initial layers, since each feature of
the intermediate layer depends on 9 features of the previous layer. This
means that a change in any of the features of the layer 1 will influence
this features as well.
It is very similar to the receptive field of a CNN.

Self-Attention during Next Token Prediction Task

Let’s break down into steps.

Self-Attention with KV-Cache

Rolling Buffer Cache

Since we are using Sliding Window Attention (with size W), we don’t need to keep all the previous tokens in the KV-Cache, but we can limit it to the latest W tokens.

Rolling Buffer Cache: how it works

Let’s add the sentence “The cat is on a chair”