Trust me, you don’t need attention because all you need is Mamba

Published in

SCB DataX

6 min readJan 31, 2024

Are you still using Transformers?

Transformers have become the most effective backbone for training AI models to achieve state-of-the-art performance in any tasks nowadays. However, there is a huge trade-off to gain the reputation because their computation under the hood needs extensive access to the data and the associations of all pieces of each input. This issue can be more severe when the input data contains longer context length due to their computation complexity which is scaled doubly according to the input’s length.

The issue has been always mitigated thanks to the recent advancements in computing resources. People enjoy all transformer’s end products without worrying about the quadratic time in the computation complexity. And one day, some researchers raised a question to the world that why don’t we just develop something else which is more computationally efficient while still delivering the state-of-the-art results?

Why need attention?

Among different flavors of attention-based mechanisms, all of them fail to become a potential competitor to the transformers in practice although their computation has been theoretically evolved to scale linearly with the input window.

This should be the time that researchers may need to step a little bit outside the box. Fortunately, a group of researchers has revealed lately that there actually is an architecture that dominates the transformers from all aspects. And its name is Mamba.

Is that Mamba real?

Before diving into the mamba architecture, let’s embrace mamba’s state-of-the-art performances across several evaluations including synthetic tasks, DNA sequences, audio waveforms, and language modeling.

Synthetic tasks

Regarding the synthetic tasks, they are must-have evaluations for sequence modeling to test the model capability if it remembers relevant information and ignores irrelevant one correctly. In the paper, the synthetic tasks contain two sub-tasks which are selective copying and induction heads. For the former one, it is a task to copy every input except those in white color as illustrated in Figure 1. Similarly, the other synthetic task, the induction heads, is a task to generalize the recognized patterns in the sequence. For example, the blue box is recognized to follow the black one. We then expect the model to return the blue box once it sees the black box as the input.

Surprisingly, the mamba architecture can surpass all existing mechanisms for both selective copying (Figure 2(a)) and induction heads (Figure 2(b)). Moreover, the mamba is the only one survivor in the arena of lengthy input sequence with the near perfect 100% accuracy as shown in the brown line in Figure 2(b).

Figure 2: Performance on the synthetic tasks, (a) selective copying and (b) induction heads. Reprinted from (Gu and Dao, 2023).

DNA and audio modalities

The long input sequences from the synthetic tasks may not be sufficient to prove the mamba’s capabilities. The researchers present another proof from a DNA modality. As you remember a DNA sequence contains codes, such as A, T, C, and G, in very long-range dependencies, this task is technically challenging to model.

For the evaluation, it is set up for species classification of human and our biological relatives, namely, chimpanzee, gorilla, orangutan, and bonobo, which are claimed that they share 99% of their DNA. In Figure 3(a), it shows the accuracy of several models, but the mamba with 7 billion parameters scores the best in the evaluation.

Figure 3: Performance on the dense modalities, (a) genomics and (b) audio modalities. Reprinted from (Gu and Dao, 2023).

In terms of the audio modality, the mamba still expresses its capabilities to efficiently handle long sequence input, approximately from 2¹³ to 2²⁰ for the sequence lengths tested in the evaluation. By comparing with the previous state-of-the-art model, named SaShiMi (S4+FFN), we can see from the Figure 3(b) that both models provide the lower bits per byte as the sequence length increases. Given that the bits per byte on the vertical axis may be treated as an error-related metric which is the lower the better, it means that the mamba outperforms the baseline.

Language modeling

For the popular tasks in present days, like LLMs, the researchers inevitably include the evaluations from the text modality as well by varying model parameters of the mamba from 130M to 2.8B parameters and testing them on some popular benchmarks as provided in Figure 4. What is highlighted in the bold-face letter, which signifies the best of each benchmark, is always located on the lines of the mamba architectures.

Figure 4: Different benchmarks for evaluate model performance in the text modality. Reprinted from (Gu and Dao, 2023).

Speed and memory

So far, we have discussed the potential of the mamba architecture from a performance perspective only. One may think that, in terms of computation efficiency, the mamba may take longer or even like forever to generate an output. However, the researchers propose a novel scan algorithm which is claimed that it is faster than the best attention implementation and what is implemented in PyTorch without any issues of out-of-memory (OOM) as presented in Figure 5(a). On top of that, the mamba also provides higher inference throughput than the transformers compared to similar model size as shown in Figure 5(b).

Figure 5: Computation efficiency on (a) scan mechanism, and (b) throughput. Reprinted from (Gu and Dao, 2023).

It may seem too good to be true, but it is true.

What is the mamba, exactly?

The origin of the mamba comes from the state-space models (SSMs) invented in 1960 by Kalman,

where x, h, and y are input, hidden, and output sequences, and A, B, C, D are coefficient matrices.

To match with the use cases in AI, the input should be discrete rather than continuous. Therefore, it requires some math work to discretize the ordinary differential equation (the h primed equation) to a discrete form, namely

where A bar and B bar are just arbitrary matrices for now.

Here, the creativity comes in play. Some researchers think that it might work if we replace the linear attention with the discrete SSM. Besides, they introduce a magic part to shift all the components in the discrete SSM one step downwards to enable the ability to handle the long-range input context. For your information, this architecture may be called as H3 architecture as presented in Figure 6.

Figure 6: Hungry Hungry HiPPO (H3) layer. Reprinted from (Fu et al., 2023).

However, this is not enough for the mamba. Some researchers also build something on top of the current architecture by applying a selective scan mechanism which is technically a generalized version of the gating mechanism in RNNs. Figure 7 presents a high-level of the mamba architecture.

Figure 7: High-level architecture of mamba. Reprinted from (Gu and Dao, 2023).

After adding so many components to the existing architecture, it may hurt the computation efficiency, so the researchers introduce a technique to switch GPU utilization back and forth between the GPU memory hierarchy to optimize the computation at each step throughout either training or inferencing period as illustrated in Figure 8.

Figure 8: Overview of the hardware-aware mechanism. Reprinted from (Gu and Dao, 2023).

And that’s it for the overview of the mamba architecture. For those who want to see more math stuff, please refer to this YouTube playlist. Though it is in Thai, you may still benefit from the content in the slide. Happy mamba-ing!

Acknowledgement

This article is a part of DataX AI research team. I would like to thank the editorial board to review my article and all the support throughout the publication process. And I also would like to thank you to read until this point. Please leave me some comments or anything through pakhapoom.sarapat@data-x.ai.