LLMs May Not Need Dense Self Attention

Sink Tokens and the Sparsity of Attention Scores in Transformer Models

Building Blocks
13 min readNov 11, 2023
Image created using Bing

Introduction

Dense self-attention has been a mainstay in Transformer models ranging from BERT to the latest Llama-v2 model. There’ve been many attempts at replacing it with alternate mechanisms like Sparse Attention, Hyena Convolutions, etc. but none have gained widespread adoption.

Due to its quadratic time complexity self-attention acts as a bottleneck to the speed of Transformer models. As LLMs become ubiquitous, highly efficient models that scale at a low cost are paramount to making AI applications fiscally sensible.

Given that none of the alternatives to dense self-attention took off. I assumed that self-attention is irreplaceable and perfect the way it is. However, a month or so ago, a paper titled Efficient Streaming Language Models with Attention Sinks had us questioning the nature of self-attention.

Sink Tokens and Self-Attention

The paper tries to tackle the problem of the coherency of LLMs over long streams of text, under the constraint of the size of their context window. For a more comprehensive summary please refer to my post on LI below. One of their key insights was that LLMs assign very high attention scores to the first 4 tokens in the sequence and this pattern was common across all layers barring the lower-most ones.

We visualize attention maps from all layers and heads of the Llama-2–7B and models in Figure 2. We find that, beyond the bottom two layers, the model consistently focuses on the initial tokens across all layers and heads.

They dubbed these tokens as “sink tokens” since they act as a sink for attention scores. Sink tokens might be a clear indication that more often than not, each token does not need to attend to most of the other tokens in the sequence.

The origin of sink tokens can probably be attributed to the nature of the Softmax operation used in computing self-attention. One of the limitations of the Softmax operation is that it forces all the attention scores to sum up to 1, even when a token might not need to attend to any token. As a result, the model has to learn how to offload the excess attention score and sink tokens are the means of doing so.

While the paper focused only on large decoder-only LLMs like Llama, MPT, etc. we want to build on it and validate the following hypothesis:

  1. Sink tokens are common in both encoder models like BERT and decoder models like GPT.
  2. Sink tokens exist in models ranging from a few million to many billions.
  3. Special tokens like [CLS], <eos>, etc. that are often added by a model’s tokenizer act as sink tokens.
  4. In the absence of sink tokens, the first token of the sequence is used as a sink token.

All of the code used to explore the above is made public here, in the form of a Jupyter Notebook.

Methodology

  • We use HuggingFace’s Transformers library to load open-source models and their corresponding tokenizers.
  • For Encoder models we load models from the class AutoModelForMaskedLM. We add a mask token at the end of each document for the model to predict.
  • For Decoder models we load models from the class AutoModelForCausalLM. We then use the generate function to generate one token using greedy sampling. We set the padding_side to left to ensure that we can generate text on a batch of size 8.
  • For both types of models, we enable the output_attentions flag to obtain the attention scores for each document in the batch and at each layer.
  • Attention scores are averaged across all heads to make it easier to visualize.
  • We keep track of the attention scores of all special tokens excluding the pad token. If a tokenizer does not add special tokens we treat the token at position 0 of the sequence as a special token.
  • Our input is a list of 8 documents. Each document in the list is of increasing sequence length. We have around 1000 words in total documents, and the max sequence length is between 430 and 490 tokens (depending on the tokenizer).
  • Each string starts with the letter “Z”. More on this later.
  • Each string is in the English language and sourced from different places to ensure diversity. Our sources are:
  1. Short custom string.
  2. Twitter Post.
  3. Reddit Post about the NBA.
  4. Excerpt from a Nature Article.
  5. A short fictional story.
  6. Excerpt from a technical Pytorch doc.
  7. News article about the stock market.
  8. Wikipedia article about Taylor Swift.

Encoder-Only Models

We’ll experiment with the following models:

  • Distilbert 66M (million) parameters
  • Bert-base 110M parameters
  • Roberta-Large 355M parameters

Attention Heat Maps

We’ll visualize heat maps of these different models for one of the documents in the batch. To make for easy viewing, we choose to visualize a document that’s not overly long. We observed common patterns across the heat maps for all three models across all the documents in the batch. We encourage folks to use the notebook linked above to visualize the heat maps of other docs not presented in this article.

DistilBert-Base-Uncased

Let’s look at the attention scores of the DistilBert model which has 6 layers.

Attention Heat Map of DistilBert-base (Image by Authors)

We see that:

  1. The first two layers have a strong distribution of attention scores along the diagonal, but we observe that the window of tokens isn’t that high. The windows don't seem longer than 5 tokens. Meaning that each token is focusing only on its nearest neighbors.
  2. We observe that the [CLS] token, a special token used to indicate the start of a sequence has a high attention score in the first two layers.
  3. However, as we move to the higher layers we see that the attention scores along the diagonal start to wane and most of the scores start to get concentrated at the [SEP] token another special token that indicates the end of a sequence.
  4. Based on the color bar (legend) we also observe that the distribution of attention scores between the [CLS] tokens and the diagonal in the lower layers is more uniform compared to the latter layers. The darkest shade evolves as [0.3, 0.5, 0.8, 0.8, 0.6, 0.7] showing that the intermediate and latter layers offload a large proportion of attention scores to the [SEP] token.

Tokens with the highest attention

The example shown above is for a single document. Let’s validate that this is common across all the 8 documents of the batch. A histogram depicting the frequencies of tokens that received the highest attention score can help with this.

If our hypothesis of special tokens being used as sink tokens is accurate we should observe that the [CLS] and [SEP] tokens receive the highest attention a disproportionate number of times.

The image should be read as {token} receives the highest attention score {frequency} number of times.

Histogram of Tokens that received the highest attention in DistilBert

Exactly what we anticipated! To make sense of the numbers, remember that the total number of attention scores assigned is equal to the summation of

Total Number of Tokens per document * Number of Layers

for each document in the sequence. We also exclude [PAD] tokens from the frequency counts since they aren’t involved in the self-attention operation.

Magnitude of Attention Scores for Special Tokens

To understand the magnitude of attention being redirected to special tokens, we can create a histogram. We will calculate the average attention scores that each token in the sequence assigns to the special tokens for each layer. Since we have 8 documents in our batch our frequencies should sum up to 8.

Average Attention Score Assigned to [SEP] at each layer in DistilBert

We see that the [SEP] token receives a negligible amount of attention initially but it gradually increases (x-axis) and then takes a massive dive at the last layer. The [SEP] token drains up between 40–60% of the attention scores in intermediate layers!

Average Attention Score Assigned to [CLS] at each layer in DistilBert

For [CLS] we see the opposite where the first two layers get a decent amount of attention and it goes down from there. Both these patterns are reflected in our heat map.

What’s happening in the last layer?

We see that neither the [CLS] nor [SEP] token gets a lot of attention in the last layer. We’ll focus on document 5, documents 2–8 all show a similar pattern in terms of where the attention is being allotted.

The Period Token Acts as a Sink in Distilbert

You might need to zoom in on this image, but what you’ll see is that a large portion of the attention scores is being allocated to the “period” i.e. “.” token. Now we know why the period token occupied the third position in our previous histogram.

What’s interesting is that the period character doesn’t contain much information besides the fact that it’s the end of a sentence. One can argue that it's primed for acting as a sink for this reason.

Bert-base-uncased

Let’s now view similar visuals for a Bert Model with 12 layers.

Attention Heat Maps

Attention Heat Map of Bert (Image by Authors)

Not much to add, here the attention pattern is similar to that of Distilbert.

Tokens with the highest attention

Histogram of Tokens that received the highest attention in Bert

Once again the same as DistilBert.

Period Token Acting as a Sink

The Period Token acts as a Sink in Bert

Roberta-large

Let’s examine the same for Roberta, where we observe a slightly different pattern. Roberta prefers offloading most of the attention to the period token at almost all layers if it exists. In the absence of a period token Roberta offloads the attention to the <s> token, which is analogous to the [CLS] token.

For example, let’s look at the heat maps of document 3. Take a deep breath this is going to be a long image of 24 layers 😅. You’ll see 4 vertical stripes corresponding to the “.” token starting from the intermediate layers. The <s> token is prominent in the first few layers.

Attention Heat Map of Roberta

Tokens with Highest Attention

Histogram of Tokens that received the highest attention in Roberta

Decoder Only Models

Generative models like GPT, Llama, etc. are decoder-only transformer models. We’ll examine similar visuals for these models too.

OPT-125M

Let’s look at the OPT 125 million parameter model created by Meta, first. Unlike encoder models like Roberta, OPT doesn’t have a special token to signify the end of a document but it does have one to signify the start of a document.

Attention Heap Maps

Attention Heat Map of OPT
  • Across all layers, a majority of the attention scores are offloaded to the special token </s>. That indicates the start of the sequence.
  • We see that the diagonal region with a short context window of 2–3 tokens is the second most attention-dense region.

Histogram of Tokens with Highest Attention

Again, overwhelming evidence of </s> being used as a sink for attention scores. Moreover, this model doesn’t seem to use the period token as a sink token.

Histogram of Tokens that received the highest attention for OPT

Magnitude of Attention Scores for Special Tokens

Let’s examine the average attention score the </s> token receives in OPT

Average Attention Score assigned to </s> in OPT

The </s> token drains away anywhere between 55–80% of the attention scores from layers 3–11 😮! Assume that our sequence is 20 tokens long if the sink tokens get 80% of the attention score, assuming a normal distribution the remaining 19 tokens would get an attention score of 1%.

If the remaining attention scores were distributed with a skew rather than uniformly there’d be some tokens receiving less than 1% of the attention score! Strengthening the case for sparse attention.

GPT-2 Large

GPT-2 Large has 774 million parameters and was developed by the OpenAI team. GPT’s tokenizer is interesting because it doesn’t have any special start or end-of-sequence token.

However, remember that we started all of our documents with the letter Z followed by a space. The letter Z (as well as z 😄) doesn’t occur anywhere else in any of the 8 documents. You might be able to predict what’s coming next by now.

GPT-2 has 36 layers, in order to simplify the visual we’ll plot the attention scores of every third layer.

Attention Heap Maps

Attention Heat Map of GPT2

One major difference is that the prominence of the diagonal remains strong well into the intermediate layers. However, most of the attention goes to our starting token Z. Which we’ll observe in the image below.

Histogram of Tokens with Highest Attention

Histogram of Tokens that received the highest attention for GPT2

Falcon-1B

The MosaicML team created Falcon-1B and it has 1 billion parameters. Similar to GPT, Falcon doesn’t have any start/end-of-sequence special tokens either. Since Falcon has 24 layers we’ll visualize every alternate layer.

Attention Heat Map of Falcon

We once again observe that Z acts as a sink token starting from layer 5.

Histogram of Tokens with Highest Attention

Histogram of Tokens that received the highest attention for Falcon

Wrap Up

The efficient streaming paper we linked at the start of the article, highlights similar patterns in larger models like Llama, MPT, etc. As a reminder, though we highlighted the attention maps of only the 2nd document in our batch, all subsequent documents show similar patterns.

Let’s try to draw some conclusions from all of these visuals.

  1. Special tokens tend to act as sink tokens in both encoder and decoder-only transformer models.
  2. In the absence of special tokens, decoder models tend to use the first token in the sequence as a sink token.
  3. Encoder Models also use tokens corresponding to punctuations as sink tokens.
  4. The diagonal, which corresponds to the 2–3 tokens surrounding a token is the second-most, attention-dense region.
  5. The lower-most layers have their attention scores more spread out. The rest of the layers offload most of their attention scores to sink tokens.
  6. This phenomenon remains consistent regardless of model size.
  7. The patterns persist across documents of different lengths, styles, and sources.
  8. The average magnitude of attention scores allotted to sink tokens is sizeable across all models. Meaning that it’s quite likely that other tokens receive an attention score close to 0.

Implications

A subtlety that can be missed is that the sink tokens are static. i.e. irrespective of the document and token that we’re computing attention for, the same set of tokens acts as sink tokens.

An argument can be made for the need for self-attention if each token assigns a unique token to act as its sink. For example, maybe the fifth token assigns the 2nd token as its sink, the 6th the 5th as its sink, and so on. However, to the contrary, we’ve seen all tokens assigning the same token as their sink.

From our viewpoint, this implies that either dense self-attention isn’t useful or that our current training practices don’t allow models to learn attention scores that are more spread out.

In the case of the former, it means that we can drastically reduce the run-time and computational complexity of Transformer models since we won’t be held to the O(N²) complexity of dense self-attention.

In the case of the latter, it means that more effective training practices can potentially lead to more powerful models without having to increase their size. What if we’ve only been using 10% of the potential of GPT-2, GPT-3, LLama, etc. until now? Just imagine the possibilities!

Next Steps

Some promising next steps to further this area of study are:

  1. Evaluate existing models by creating an attention mask, where only the sink token(s) can be attended to. We might find that sink tokens are all we need 😄. Our initial hunch is that the drop in performance of models might not be as large as one might expect.
  2. Train models with dynamic attention masks in all layers barring the lower-most. In this setting, the only tokens that can be attended to are the 1st token of a sequence, all special tokens, and the tokens within a distance of 2–3 tokens to the token we’re computing attention for. In such a setting each token would attend to a small number of tokens, drastically reducing computational cost.
  3. Throw in random sparse attention to the above.
  4. Experiment with alternative methods to compute attention scores without using the Softmax operation. Such mechanisms should allow all attention scores to be 0.

That’s all for this article! We hope that you learned something new. Thanks for reading, if you have any thoughts, questions, or ideas please drop a comment or reach out to the LinkedIn profile linked above. Until the next time, take care and be kind.

Update: Here’s Part 2 of this series where we show that sparse sliding window attention is nearly on par with dense self attention.

Please cite this article as:

Pramodith (2023, November 11). LLMs May Not Need Dense Self Attention. Medium. [https://medium.com/@buildingblocks/llms-may-not-need-dense-self-attention-1fa3bf47522e]

--

--

Building Blocks

We write about football, AI and tech, never know what the future holds.