What Is RetNet ? could it challenge the dominance of the Transformer ?

Microsoft recently published an article introducing a new architecture for large language models (LLMs) called the Retentive Network. This articles claims that this new architecture will be the successor of the Transformer. (see here). Will this new architecture revolutionize AI the same way the Transformers did ?

Published in

Publicis Sapient France

5 min readNov 26, 2023

TLDR

The main advantage of this technology is that it solves the “impossible triangle” of AI, whose three vertices are low inference cost, high performance, and parallelized training. One of the major drawbacks of Transformers is their enormous cost during inference, which significantly complicates and increases the cost of their deployment at scale.

The “impossible triangle” of AI from the original article

The Retentive Network, thanks to its form that can be express both by a recursive formula and a formula that can be ran in parallel, retains the best of both worlds, with high performance due to the parallelization of training, while maintaining a much lower inference cost using its recursive form.

The return of RNN?

The idea here is to combine the advantages of both worlds between RNNs and Transformers.

Dual form of the same mathematical transformation

The core concept of this mechanism is a dual (or even triple) form of the same calculation to achieve a representation that is:

Parallelizable: a form of the calculation that is parallelizable for training, and is very similar to the classical self-attention mechanism.

Here the Q correspond to the Query matrix, the K to the key matrix and the V to the value matrix. We will cover the D matrix in the next section.

Recursive: the same calculation but in a recursive form that allows for fast inference, and that can be similar to RNNs.

Here |x| correspond to the input sequence length, and Qn,Kn,Vn are the same as in the previous section at the n-th timestep and γ correspond to the D matrix (see next section).

- A chunkwise representation: which divides the input sequence of the network into several chunks that can be encoded independently and then recurrently between blocks to handle very long input sequences. This form is mainly uses to accelerate training but I won’t go into details on it.

In short, the goal is to express the same mathematical transformation (same result for the same input) in different formats while staying close to the powerful concept of self-attention.

How is this dual representation achieved? What are the differences with Transformers?

To do this, it is necessary to abandon the application of the softmax function inside the attention head, which prevented a recurrent representation. This step is replaced by two sub-steps.

First, a Hadamard product (element-wise matrix multiplication) with a new matrix D, followed by a Group Norm operation. The peculiarity of this matrix D is that it can be obtained recursively. This matrix D is lower triangular of dimension nxm and contains a gamma raised to the power of n-m (the further down and to the left, the higher the power). This gamma is a number close to but less than one, causing an exponential decay in the matrix.

This matrix is applied to the result of QK.T in replacement of the softmax function. Then, a Group Norm normalization is applied. The intuition here is to apply a normalization at the group of token level instead of the full sequence level. (for a more detailed explanation, see the original article ).

In a complete architecture with multiple heads and layers, each head obtains a different gamma: larger or smaller but close to 1. The closer it is to 1, the less the matrix D penalizes distant tokens. This specializes the heads for near or far context. To avoid the problem of exploding gradients and add non linearity, a normalization step called GroupNorm is added, along with the implementation of 3 normalization factors (on the matrix QKT, on the matrix D and on the result of the Hadamard product). The rest of the architecture is quite similar to Transformers.

However, is this architecture effective? Is it a good approximation of the softmax function?

Performance analysis

Empirically from the paper, the performance of the RetNet architecture is quite satisfactory. For models of similar size (7B) and an 8k sequence length, RetNet’s inference speed is 8.4 times faster, GPU memory requirements are reduced by a factor of 3, and it is 7 times faster than Transformer models during training. Even when applying FlashAttention (technics that improve Transformer speed by using GPUs in a better way as described here ) , RetNet retains an advantage, albeit much smaller.

The RetNet model is invariant with respect to the input sequence length (O(1)) in terms of memory and computation, making it more efficient compared to the classic Transformer as the sequence length increases.

Conclusion

In this brief article, we have explored the unique aspects of RetNet and its performance. However, the success of this new kind of Transformer will largely depend on its implementation in larger-scale models. If the performance on much larger models is promising enough and can compete with OpenAI’s models at a significantly lower running cost, the industry may adopt this architecture more widely. For now, it seems that the solution using FlashAttention (especially version 2 as described here ) on Transformers will concentrate most of the research efforts due to its closeness to the Transformer architecture, which is simpler than the Retentive Network one, and its ease of implementation and rapid adoption in the sector.

Nevertheless, the properties of RetNet in terms of speed and performance in a “raw” form compared to highly optimized forms of Transformers make it a technology worthy of interest. As the field of AI continues to evolve, RetNet may prove to be a valuable addition to the range of architectures available for tackling complex language tasks.

Thank you for reading and thanks to Romain Benassi for the careful proofreading.