Microsoft’s Samba: A Hybrid Architecture Revolutionizing Language Modeling

Published in

𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨

7 min readJun 18, 2024

Introduction

Context Language Modeling has been the site of much activity, with Large Language Models (LLMs) continuously showing stunning capabilities in natural language processing tasks and beyond. These advances across architectural innovations, better strategies of training, improvement of context length, fine-tuning, multimodal LLMs in areas of robotics, datasets, benchmarking, efficiency, and many more, but efficiently modeling sequences of infinite context length was the need of the hour. The main weakness of previous works lies either in the quadratic computation complexity or in the limited extrapolation ability on length generalization.

Samba is a model developed by a group of Microsoft and University of Illinois researchers at Urbana–Champaign. Their ambition was to bring a simple yet powerful hybrid model into the field of Context Language Modeling, a model able to handle an unlimited context length.

What is Samba?

Samba is an innovative model that stands as a Simple Hybrid State Space Model, designed specifically for Efficient Unlimited Context Language Modeling. It’s a unique blend, layer-wise, of two components: Mamba, a selective State Space Model (SSM), and Sliding Window Attention (SWA). This hybrid architecture allows Samba to efficiently handle language modeling tasks with unlimited context.

Key Features of Samba

Samba comes with a set of unique features that set it apart:

It is engineered to manage an unlimited context length.
The most extensive model, Samba-3.8B, has been trained on a massive dataset of 3.2 trillion tokens from the Phi3 dataset.
Samba exhibits impressive scalability. When trained on sequences of 4K length, it can be efficiently extrapolated to handle a context length of 256K while maintaining perfect memory recall.
SAMBA performs by 3.64× better in decoding throughput than the state-of-the-art Llama-3 architecture achieving better prediction up to 1M tokens on the Proof-Pile test set.

Proof-Pile test set — source — https://arxiv.org/pdf/2406.07522

Potential Use Cases of Samba

The Samba model applies to many real-world use cases thanks to its unique features and capabilities:

Language Translation: The perfect memory capability of the model would be handy for language translation, particularly with languages whose sentences are very long, and the context is crucial to getting the correct translation.
Content Recommendation Systems: As Samba can learn and predict from a tremendous amount of context, it can be used for building complex content recommendation systems that serve very relevant suggestions to the users.
Data Mining and Information Extraction: By unlimited context length and enhanced token prediction, mining data in large text databases is possible with Samba.

These are but a few potential applications. However, the sky is the limit for Samba, limited only by the imagination and inventiveness of its users.

Architecture of Samba

The architecture of Samba is a fascinating blend of various components, each serving a unique purpose in the overall model. At the heart of this architecture is the Mamba layer, a recently proposed State Space Model (SSM) with selective state spaces. This layer enables input-dependent gating to both the recurrent states and the input representation, allowing for a soft selection of the input sequence elements. This feature allows the model to focus on relevant inputs, thereby enabling it to memorize important information over the long term.

Layer-wise integration of Mamba with various configurations of Multi-Layer Perceptrons (MLPs) and Sliding Window Attention (SWA) — source — https://arxiv.org/pdf/2406.07522

One of the key features of the Mamba layer is its ability to selectively compress a given sequence into recurrent hidden states. This selective compression is crucial as it allows the model to manage the complexity of the input sequence effectively. Despite this compression, the Mamba layer maintains the ability to precisely recall memories with the attention mechanism. This balance between compression and recall is what makes the Mamba layer particularly powerful.

Complementing the Mamba layer are the Sliding Window Attention (SWA) layer and the Multi-Layer Perceptron (MLP) layer. The SWA layer addresses the limitations of the Mamba layer in capturing non-Markovian dependencies in sequences. It operates on a window size that slides over the input sequence, ensuring that the computational complexity remains linear with respect to the sequence length. The MLP layers serve as the architecture’s primary mechanism for nonlinear transformation and recall of factual knowledge.

In the hybridization strategies explored, three kinds of layerwise strategies are considered: Samba, Mamba-SWA-MLP, and Mamba-MLP. Each model has approximately 1.7B parameters, with the number of layers set to 48 for Samba, Mamba-MLP, and Mamba, while Mamba-SWA-MLP has 54 layers. The goal of these hybridization strategies is to harmonize between these distinct functioning blocks and find an efficient architecture for language modeling with unlimited-length extrapolation ability.

Performance Evaluation with Other Models

The performance of Samba, and in this particular case, the Samba 3.8B model, has been very extensively compared and evaluated against other available pre-trained base language models using a wide variety of benchmarks.

Comprehensive Evaluations on a diverse subset of the benchmarks to assess SAMBA’s performance. — source — https://arxiv.org/pdf/2406.07522

As summarized from the extensive comparison shown in table above, Samba 3.8B, which has been trained on 3.2 trillion tokens from the Phi3 dataset, is much better than any other model. These range from Llama 2, over Mistral, Mamba, Gemma, R-Gemma, and Llama 3 to TFM++. As a first observation of the results, it is worth noting that Samba achieves a strong performance on the GSM8K benchmark, with an absolute 18.1% higher accuracy than TFM++ trained on the same dataset. This points out that from the fusion of SSM and the attention mechanism, a somewhat surprising complementary effect is arising.

Performance Comparison of instruction-tuned Samba 3.8B and Phi-3-mini-4K for both long-context and short-context tasks. — source — https://arxiv.org/pdf/2406.07522

Additional insights on Samba’s performance are presented in table above, which reports a downstream performance comparison of instruction-tuned Samba 3.8B and Phi-3-mini-4K for both long-context and short-context tasks. Samba considerably outperforms Phi-3-mini-4k-instruct in the short-context (MMLU, GSM8K, HumanEval) and long-context (GovReport) tasks. This further reinstalls that Samba-3.8B, trained on 3.2T tokens of Phi3 data, outperforms all major benchmarks passed with considerable margins on Phi3-mini. This performance evaluation thereby reinstalls the efficacy of Samba against the background of Language Modelling.

Samba’s Leadership in Context Modeling

In AI models, Samba, Mamba, Mistral, and Llama 3 are known best for some unique qualities when studied in-depth. Particular mention should be Samba because of its hybrid architecture of combining Mamba with a selective State Space Model (SSM) endowed with Sliding Window Attention (SWA) to perfectly handle very long sequences up to 1M context length with perfect memory recall. The third model, the big bird, compares with models pre-trained on long sequences. It also has a higher throughput and speedup than Transformers with grouped-query attention. Mamba and Mistral also have efficient inference mechanisms, but they do not explicitly mention their ability to handle long sequences of the exact nature.

Mamba, a selective State Space Model (SSM), is different from many other SSMs in the sense that it achieves fast inference linear in the sequence length, while actual data experiments show performance improvements on million-length sequences. Mistral is a 7B-language model that achieves more efficiency and top performance on all benchmarks evaluated, surpassing Llama 2 13B and Llama 1 34B for reasoning, mathematics, and code generation. Llama 3 is the most up-to-date open-source Large Language Model, representing current advances in Artificial Intelligence and aimed at understanding and generating human-like text.

Thus, the specific combination of Samba, for example, Mamba’s selective SSM and SWA enables it to be used very effectively with highly long sequences, something that distinguishes it from both Mamba and Mistral with Llama 3. The excellent memory recall properties and its high throughput make it a good fit for many very long sequence tasks. This makes Samba an extremely efficient and competent model in handling tasks where there is both understanding and generation of long text sequences. While each model has its strengths, the uniqueness of Samba in terms of capabilities makes it a prominent model for leadership in Context Language Modeling.

How to Use and Access the Model?

Samba’s official implementation can be found on the GitHub repository. The complete script to train the Samba models on SlimPajama, that is, Samba-421M and Samba-1.3B, is given in the repository. You will also find scripts for preparing the SlimPajama dataset for training. After preparing the dataset, you can launch a job to train the model.

If you would like to read more details about this AI model, the sources are all included at the end of this article in the ‘source’ section.

Limitations

While Samba shows very encouraging retrieval performance, with practical instruction tuning, the same does not happen for its base model trained from pre-training because it often does not surpass the SWA-based model. So, there is still the possibility to develop Samba’s retrieval capability further while maintaining efficiency and generalization. Further, the hybridization strategy of Samba is not better than other alternatives for all tasks. For example, the Mamba-SWA-MLP model is better across functions, including WinoGrande, SIQA, and GSM8K.

Conclusion

Samba does show a significant stride in Context Language Modeling. It tackles the age-old issue of modeling effectively sequences that are infinitely long regarding the context length and can beat state-of-the-art models on a comprehensive suite of benchmarks. However, the success of Samba moves us to future directions in this fascinating field of AI.

Source
research paper: https://arxiv.org/abs/2406.07522
research document: https://arxiv.org/pdf/2406.07522
GitHub Repo: https://github.com/microsoft/Samba

Originally published at https://socialviews81.blogspot.com.

Microsoft’s Samba: A Hybrid Architecture Revolutionizing Language Modeling

Written by My Social