Microsoft’s LongNet Scales Transformer to One Billion Tokens

Published in

SyncedReview

3 min readJul 11, 2023

Scaling sequence length is of paramount importance for large language models, as it brings about singnificant benefits. These advantages include a large memory and receptive field for more effective human communication, intricate causality and reasoning pathyways to leverage training data, and the potential to overcome the limitations of in-context learning.

In their recent paper LongNet: Scaling Transformers to 1,000,000,000 Tokens, a Microsoft research team introduce LongNet. This transformer variant successfully scales sequence longth to more than one billion tokens while maintaining stronger performance and maintaining a linear computation complexity.

The challenge of scaling up sequence is to strike the balance of between the computational complexity and the model expressivity. The solution of this work is LongNet, which replaces the attention of vanilla Transformers with dilated attention, a novel component that splits the given inputs of query-key-value pairs into the corresponding segments equally with a given segment length. Each segment is then sparsified along the sequence dimension, which later are fed into the attention in parallel…

Microsoft’s LongNet Scales Transformer to One Billion Tokens

Written by Synced