Is Attention All You Need?

An Intuitive Approach to Understanding the Mamba Models

Manthandeshpande
Accredian
8 min readFeb 26, 2024

--

Introduction

Time complexity is an important aspect of measuring the efficiency of any algorithm. When we talk about Large Language Models, Transformers have improved the long-range frequency but are still computationally expensive.

Quadratic time complexity is the computational cost associated with the self-attention mechanism, which is a key component of transformer architectures like the Transformer model introduced in the 2017 paper, “Attention is All You Need” by Vaswani et al.

As the input size increases, so does the complexity in the quadratic scale.

The time it takes to calculate attention quadratically grows with each new data point you add to a series.

There have been many attempts and architectures released since to address this issue, namely: Linear Attention, Gated Convolution, Recurrent Networks, and State Space Models (SSMs).

What we’re dealing with today is Mamba which belongs to the SSMs category, which uses Linear Time Sequence Modelling.

Mamba, 2023

Setting up the base-

If we track the origin of Mamba models, we can say that they are built based off of RNNs which gave rise to the State Space Search Models.

A State Space Model is a version of RNN

Recurrent Neural Networks iterate through sequences of data while maintaining a hidden state representing information from previous time steps. At each time step, the RNN receives an input vector and combines it with the previous hidden state to produce an output and update its hidden state.

RNNs

The drawback is that the information collapses in the hidden state. RNNs suffer from vanishing or exploding gradient problems, limiting their ability to capture long-range dependencies resulting in forgetting information on longer sequences.

To deal with this limitation, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) have gating mechanisms within the RNN cell to decide what exactly to remember. But considering the latent space to be of a fixed size, there’s only so much we can fit to remember the context.

State Space Models (SSM)

What SSMs do is convert an input to a dynamic system. It takes up a 1D input and through its mathematical framework, it maps it into a latent space and then projects it back into a dynamic 1D output.

If we generalize a bit, we can say that SSMs to Mamba are what Attention is to transformer-based models.

Mathematically, we’ll have a hidden state equation h’(t) and an observation equation y(t):

State Equation: Describes how the internal state of the equation evolves over time

h’(t)= Ah(t) + Bx(t)

Observation Equation: Relates the observation to the hidden state

y(t)= Ch(t)

where A, B, and C are learned parameters and can be learned through various algorithms like gradient descent, etc.

We believe that by resolving these equations, we will be able to identify the statistical principles that allow us to forecast a system’s state from observable data (input sequence and prior state).

Finding this state representation h(t) is its objective so that we can convert an input sequence to an output sequence.

These two equations are the core of the State Space Model.

The basic ideas of a State Space Model (SSM), which forecasts a system’s state from observed data, are covered in this section. The state equation and the output equation are the two main equations used in the SSM. The learning process is influenced by the parameters represented by the matrices A, B, C, and D.

The architecture that results from visualizing these two equations is as follows:

The internal dynamics of the system are represented by the latent state representation (h(t)) in the state equation, which varies depending on the input (x(t)) through matrices B and A. The translation of the state to the output using matrices C and D — where D serves as a skip connection is then described by the output equation. The SSM can be made simpler by removing the skip-connection and concentrating only on matrices A, B, and C.

Updating these matrices during the learning phase maximizes the prediction accuracy of the model. The SSM’s continuous-time representation is emphasized, highlighting its use in situations with continuous input. In general, the SSM offers a framework for comprehending and forecasting system states using underlying dynamics and observed data.

These two formulas work together to forecast a system’s state based on data that has been observed. The primary representation of the SSM is a continuous-time representation since it is anticipated that the input will be continuous.

Discretization

The procedure of discretizing a State Space Model (SSM) to handle the analytical difficulties posed by discrete inputs, like textual sequences, and continuous signals. To transform discrete signals into continuous ones so that the SSM can provide a continuous output, the zero-order hold technique is presented. The length of time the discrete signal values are kept to produce the continuous signal is determined by the step size (∆), which is presented as a learnable parameter.

Mathematically, continuous SSM is transformed into a discrete version using the Zero-order hold approach. This is expressed as a sequence-to-sequence connection (xₖ → yₖ), where the discretized model parameters are now represented by matrices A and B. Time steps that are discretized are indicated with the notation k.

The continuous form of Matrix A is preserved during training, and discretization is carried out to capture the continuous representation, as is indicated.

To sum up, It sheds light on the discretization of a continuous State Space Model, allowing the model to work with discrete inputs and produce discrete outputs from the continuous representations.

The Convolution Representation

A different representation for State Space Models (SSMs) takes the idea from traditional picture recognition tasks and applies it to a text-friendly 1-dimensional viewpoint through the use of convolutions. The SSM formulation serves as the basis for a kernel, which is then used in image processing convolution processes. To compute the output and show how padding affects the outcome, the kernel iterates over sets of tokens.

Like convolutional neural networks (CNNs), the SSM can be trained in parallel, which is one benefit of modeling it as a convolution. It is observed, however, that the inference speed of SSMs with convolutions is not as rapid and unbounded as Recurrent Neural Networks (RNNs) because of a fixed kernel size.

This demonstrates the trade-offs that various neural network topologies have between inference speed and parallel training efficiency.

The Recurrent Representation

A reformed discretized State Space Model (SSM) that can be used with a recurrent method akin to Recurrent Neural Networks (RNNs) by thinking about discrete timesteps. The output is predicted at each timestep by computing the impact of the current input on the preceding state.

The formulation can be unrolled or unfolded, similar to how an RNN would sequentially unfold, and the representation is similar to the structure of an RNN. This method retains all of an RNN’s benefits and drawbacks, such as quick inference but lengthier training times.

To sum up, the discretized SSM, when implemented with discrete timesteps, is consistent with an RNN’s underlying technique by providing a trade-off between training speed and efficient inference.

Mamba structure

The Mamba architecture, with an emphasis on the use of its decoder component in an end-to-end demonstration. A stack of Mamba blocks is used, with each block first expanding input embeddings using a linear projection, then convolution, and finally the application of the Selective State Space Model (SSM). The Selective SSM has several characteristics: it is a discretization-based recurrent SSM; it uses a hardware-aware method for computational efficiency; it captures long-range dependencies using HiPPO initialization on matrix A; and it uses a selective scan technique for information compression.

Additional features like normalization layers and softmax for output token selection are demonstrated in the code implementation. When the full architecture is put together, it offers training with boundless context in addition to quick inference. The Mamba architecture, according to the authors, performs on par with or better than Transformer devices of comparable size. In conclusion, the Mamba design shows better or comparable performance in a range of tasks, making it a viable and efficient substitute for Transformer models.

Speed: Inference

As for inference, Mamba achieved 4–5x higher throughput than a Transformer of a similar size. The improved throughput enables Mamba to handle larger workloads, making it suitable for tasks that require processing vast amounts of data, such as chatbots, language translation, and speech recognition. Ultimately, higher throughput enhances the overall user experience by delivering faster and more responsive AI systems.

Evaluation Tasks

Mamba outperformed the most popular open-source models at these sizes in the case of downstream zero-shot evaluation tasks. Assessing a model’s performance on tasks or domains that were not part of its training data is known as zero-shot evaluation. The fact that Mamba completed these tasks successfully shows that it can generalize effectively and predict outcomes accurately, even in unknown settings. This is beneficial since it illustrates Mamba’s adaptability and promise for practical uses where models must manage heterogeneous and dynamic data.

Conclusion

In conclusion, Mamba has proven its remarkable performance in a variety of sectors, proving that it can equal or even outperform the most advanced Transformer models in some situations. Even though Mamba effectively gets around the drawbacks of conventional transformers, it’s vital to remember that it’s still very young. We should expect more innovations as scholars and professionals explore Mamba’s potential, which makes it an attractive prospect for sequence modeling in the future.

--

--