LongNet: To 1 Billion Tokens and Beyond

7 min readAug 1, 2023

LongNet is a new variant of the Transformer model that enables the modeling of extremely long sequences of text, with a scalable sequence length of more than 1 billion tokens. It achieves this by employing sparse attention, dilated attention, and a linearly increasing number of attention heads. Meanwhile, it maintains performance on shorter sequences, and with linear computation complexity and a logarithmic dependency between any two tokens in a sequence.

LongNet compared to previous LLMs — from the paper

This article is in the continuation of kAi Sabanci’s post regarding LongNet, where I briefly wrote about it.

In the upcoming sections, you can get a better understanding of all concepts mentioned above, and find related links for more information.

What is LongNet?

LongNet, developed by Microsoft Research (Ding et al.), is a ground-breaking feat in the fields of Natural Language Processing (NLP) and Machine Learning (ML) because it addresses the need for transformer models that can handle longer sequences of input text.

Existing methods for modeling long sequences have either been computationally complex or limited in their expressiveness. LongNet addresses this challenge by introducing dilated attention, which expands the attentive field exponentially as the distance grows. This allows LongNet to scale the sequence length to more than 1 billion tokens without sacrificing performance on shorter sequences.

LongNet also uses a fixed pattern of sparse attention, which restricts the query’s access to a subset of keys and values. This reduces computational inefficiencies and helps LongNet to become more efficient to model long sequences.

LongNet has a linear computation complexity and a logarithm dependency between any two tokens in a sequence. Additionally, LongNet can be used as a distributed trainer that parallelizes the training of a sequence across multiple GPU devices.

What is a Transformer?

The transformer model is a type of neural network architecture that is used for NLP tasks. It is based on the attention mechanism, which allows the model to learn long-range dependencies between words in a sequence.

The model was popularized for its potential for a wide range of NLP tasks thanks to the paper Attention is All You Need by Vaswani et al. (2017). The researchers outlined why their new Transformer model, which was solely based on attention mechanisms, outperformed all previous models.

The Transformer Architecture — from Lil’Log

For those who wish to learn more about the transformer model, check out this article by Jay Alammar called The Illustrated Transformer, where he explains everything in greater detail.

The Narrated Transformer Language Model — Jay Alammar

LongNet vs. Other Transformers

One might ask:

“How is LongNet different from the rest of the Transformer models? How does it outperform the rest?”

To answer this question, we can take a look at the 2 types of Attention mechanisms that the researchers have incorporated into this system:

Dilated Attention
Sparse Attention

Other Transformer models, such as GPT, Sparse Transformer, Reformer, and Recurrent Memory Transformer, have limitations in terms of their ability to scale sequence length or their computational complexity. For example, GPT has a maximum sequence length of 2048 tokens, while Sparse Transformer and Reformer have a maximum sequence length of 16,384 tokens. RMT has a maximum sequence length of 262,144 tokens, which is larger than other models, but still significantly smaller than LongNet.

Comparison between LongNet and dense Transformers — from the paper

With these concepts introduced, let’s understand what Attention is, and see how it is implemented in LongNet.

What is Attention?

Attention is a mechanism that allows a neural network to focus on specific parts of an input sequence. This is important for various objectives, where the meaning of a sentence can depend on the relationships between different words.

Attention mechanisms have been shown to be very effective for a variety of NLP tasks, including machine translation, text summarization, and question-answering. Some of the most famous papers in the field of attention for NLP are Neural Machine Translation (Bahdanau et al., 2014), Multi-Head Attention (Zhy et al., 2021), and many others (that I have personally not read, so why include here?)

In his video Attention Mechanism in a Nutshell, Mohammad Namvarpour explains the inner workings of this mechanism when it comes to deep learning.

Attention Mechanism In a nutshell — Halfling Wizard

Attention in LongNet

As mentioned earlier, LongNet uses dilated attention as its core, which reduces the computation complexity from quadratic to linear. Dilated attention allows the model to expand the attentive field exponentially as the distance grows, making it a great option for modeling very long sequences.

Splitting the inputs into segments and sparsification of said segments — from the paper

Dilated attention is a type of attention mechanism that allows a model to focus on different parts of an input sequence, even if those parts are far apart. This is done by using a “dilated” convolution, which is a convolution that skips over some of the input elements. There are 3 main steps in the process:

The input sequence is first encoded into a sequence of hidden states.
The attention mechanism then uses a dilated convolution to compute a weight for each hidden state, based on how relevant it is to the output.
The weighted sum of the hidden states is then computed, and this is used as the input to the decoder.

In LongNet, dilation expands the attentive field exponentially as the distance between tokens grows, which reduces the computation complexity from quadratic to linear, and it is a drop-in replacement for standard attention because it can be transformed into a dense Transformer that supports off-the-shelf optimization for Transformers, such as kernel fusion, quantization, and distributed training.

Comparison between Attention Models — from the paper

What are Tokens?

A token is a unit of text that is used to represent a word, phrase, or other meaningful element in NLP. Tokens are used to break down text into smaller, more manageable units that can be analyzed by such language models. A simple way to think about tokens is as the building blocks of a natural language, such as English or Turkish. Just as words are the building blocks of sentences, tokens are the building blocks of text.

But 1 Billion Tokens?!

LongNet uses a combination of techniques such as gradient checkpointing, mixed precision training, and model parallelism to reduce the memory footprint of the model and enable training on multiple GPUs. Parallelization of training helps with the memory limits of GPUs since it would be impossible for a single device to scale up to millions of tokens alone. The authors also mentioned the Distributed Algorithm that they utilized for their training process. The combination of all of these methods has enabled the team to reach a spectacular number of 1 billion tokens (and more) for the input sequence.

Distributed training of LongNet on 2 GPUs — from the paper

Why is LongNet Important?

What this model has attained is extremely impressive compared to other LLMs because it can greatly expand the length of its input sequence to more than 1 billion tokens with a low computation and memory complexity, which is significantly larger in older models. Its main distinguishable factors are:

Dilated Attention
Linear Computation Complexity
Distributed Training
Drop-in Replacement
Applicability for Many Tasks

This paper plays a crucial role in future advancements of language models because it addresses a critical demand for scaling sequence length in Large Language Models (LLM). Existing methods struggle with either computational complexity or model expressivity, resulting in a limited sequence length or mediocre outputs. If applied correctly, many industry-level applications can be found by leveraging LongNet’s exceptional ability to handle lengthy inputs, such as the simultaneous processing of thousands of documents in real-time, more accurate weather forecasting using numerous streams of data, better delivery of information by condensing large amounts of data with low latency, and many more.

In his video, David Shapiro explains why LongNet has the potential to be a stepping stone for creating Artificial General Intelligence (AGI) in the upcoming years, as it has a promising future. Feel free to check out his video for some interesting takes on the future of LLMs.

Microsoft LongNet: One BILLION Tokens LLM — David Shapiro

Conclusion

In this article, we went over what LongNet is, the foundations of its architecture (transformer, attention, token), and its outstanding results.

What do you think about this paper? Will we be able to input the content of the whole internet as a single prompt one day? And if so, which company would be able to get there first? If Microsoft achieves this, then maybe SkyNet would be a better name for its next paper.

You can find me on LinkedIn or GitHub.