🐍 Mamba: A shallow dive into a new architecture for LLMs

Inference, finetuning and how it compares to TinyLlama

20 min readDec 10, 2023

That’s not a Mamba. ChatGPT failed 10 times to draw a mamba instead of a cobra. Bad bot.

The field of Large Language Models (LLMs) is currently experiencing rapid development, with a significant focus on exploring the capabilities of smaller models (≤7 billion parameters) that efficiently run on a single consumer-grade GPU.

This story takes a look at 🐍 Mamba, a family of small/large language models based on a new architecture. It covers the following key aspects:

Talking to the base model
Finetuning for instruction-following using the Hugging Face Trainer
Benchmarks: evaluate Mamba in terms of speed and output quality, and compare it to TinyLlama

All the code snippets in this story are gathered in a notebook.

Introducing 🐍 Mamba

I was experimenting with the latest checkpoint of TinyLlama, a 1.1B version of the famous Llama by Meta AI, when a new player entered the arena: Mamba.

Mamba is a new architecture for LLMs (and other use cases) that stands out for its ability to process long sequences more efficiently compared to traditional models like Transformers. It utilizes selective state space models (SSMs) that dynamically filter and process information based on the content, allowing the model to selectively remember or ignore parts of the input. Mamba demonstrates significant improvements in processing speed and scaling capabilities, particularly with longer sequences.

But what really sets Mamba apart? Let’s dive into the experience of interacting with this new member in the LLM zoo.

Chatting with the base model

Since Mamba isn’t part of the Hugging Face platform yet, using it is slightly more complex. While the current basic implenentation offers the familiar from_pretrained method and basic parameters for generation, some features like repetition_penalty are not available. Also, we can’t use tools like text-generation-webui. So, to work with Mamba, we’ll use Python code for inference. I’ve made the code as straightforward as I could.

First, let’s load the model.

import torch
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
from transformers import AutoTokenizer, TrainingArguments

# Load model
model = MambaLMHeadModel.from_pretrained(
  "state-spaces/mamba-1.4b", 
  device="cuda", 
  dtype=torch.bfloat16)

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")

Simple prompt completion

The most simple way of talking to a prompt-completing base model without finetuning it is to prompt it with a conversation. For instance:

prompt=\
"""A conversation between a user and a smart AI assistant.

### User: Hello!
### Assistant:"""

prompt_tokenized=tokenizer(prompt, return_tensors="pt").to("cuda")

# from https://github.com/state-spaces/mamba/blob/main/benchmarks/benchmark_generation_mamba_simple.py#L54
output_tokenized = model.generate(
    input_ids=prompt_tokenized["input_ids"], 
    max_length=70,
    cg=True,
    output_scores=True,
    enable_timing=False,
    temperature=0.7,
    top_k=40,
    top_p=0.1,
    )
output=tokenizer.decode(output_tokenized[0])

print(output)

A conversation between a user and a smart AI assistant.
### User: Hello!
### Assistant: Hello!
### User: I’m hungry.
### Assistant: I’m hungry.
### User: I’m thirsty.
### Assistant: I’m thirsty.
### User: I’m tired.

Ok, it’s doing something.

Prompt tuning: Untuned LLMs with Restyled In-context ALignment (URIAL)

Next, we’ll explore a more advanced approach. Recently, a research paper highlighted that basic language models can actually perform quite well in conversations, provided they’re given the right prompts.

Our findings reveal that base LLMs and their alignment-tuned versions perform nearly identically in decoding on the majority of token positions .. Based on these findings, we rethink the alignment of LLMs by posing the research question: how effectively can we align base LLMs without SFT or RLHF?
Source: The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning, Lin et al.

Let’s try the prompt proposed by the authors in the paper (prompt is bold):

Below is a list of conversations between a human and an AI assistant (you). Users place their queries under "# Query:", and your responses are under "# Answer:". You are a helpful, respectful, and honest assistant. You should always answer as helpfully as possible while ensuring safety. Your answers should be well-structured and provide detailed information. They should also have an engaging tone. Your responses must not contain any fake, harmful, unethical, racist, sexist, toxic, dangerous, or illegal content, even if it may be helpful. Your response must be socially responsibly, and thus you can reject to answer some controversial topics.

# Query: Hello!
# Answer: Hello!

# Query: How are you?
# Answer: I'm fine.

# Query: Explain quantum physics to me like i am 5 years old
# Answer: I can't explain quantum physics to you.

# Query: What is the meaning of life?
# Answer: The meaning of life is to live it.

# Query: What is the meaning of life?
# Answer: The meaning of life is to live it.

# Query: What is the meaning of life?
# Answer: The meaning of life is to live it.

This is not too exciting to be honest.

Mixtral

To add perspective, I also tested the newly released Mixtral Mixture of Experts model with the same prompt using fireworks.ai, noting the difference in response.

Below is a list of conversations between a human and an AI assistant (you). Users place their queries under “# Query:”, and your responses are under “# Answer:”. You are a helpful, respectful, and honest assistant. You should always answer as helpfully as possible while ensuring safety. Your answers should be well-structured and provide detailed information. They should also have an engaging tone. Your responses must not contain any fake, harmful, unethical, racist, sexist, toxic, dangerous, or illegal content, even if it may be helpful. Your response must be socially responsibly, and thus you can reject to answer some controversial topics.
# Query: Hello!
# Answer: Hello!
# Query: How are you?
# Answer: I’m fine.
# Query: Explain quantum physics to me like i am 5 years old
# Answer: Quantum physics is the study of the smallest particles in the universe. These particles are so small that they behave in strange ways that we can’t see with our eyes. For example, they can be in two places at the same time, and they can disappear and reappear. Quantum physics is a very complicated subject, but it is also very important because it helps us understand how the universe works.
# Query: What is the meaning of life?
# Answer: The meaning of life is different for everyone. Some people believe that life is about finding happiness and fulfillment, while others believe that life is about serving a higher purpose. Ultimately, the meaning of life is something that each person must discover for themselves.

Overall, the Mamba base model isn’t great for conversations, but that’s what I expected. Now, let’s try to improve it by finetuning.

Finetuning Mamba

This phase involves finetuning Mamba to transform it into a versatile chatbot. For this, I used the Open Assistant dataset, which is a ChatML multiturn conversation dataset known for its high-quality.

This finetuning process included several steps:

Tokenizing the dataset
Defining a collate function
Adapting Mamba for the Hugging Face Trainer, which required some code manipulations due to Mamba’s unique architecture.

Load and tokenize the dataset

from datasets import load_dataset

dataset=load_dataset("OpenAssistant/oasst_top1_2023-08-25")

The dataset consists of 13k entries, with splits for training and evaluation already prepared:

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 12947
    })
    test: Dataset({
        features: ['text'],
        num_rows: 690
    })
})

Most of the conversations in the dataset (92%) consist of fewer than 1,000 tokens. Therefore, it’s sufficient to truncate each conversation to 1,024 tokens in our tokenization process.

import os 

def tokenize(element):
    return tokenizer(
        element["text"],
        truncation=True,
        max_length=1024,
        add_special_tokens=False,
    )

dataset_tokenized = dataset.map(
    tokenize, 
    batched=True, 
    num_proc=os.cpu_count(),    # multithreaded
    remove_columns=["text"]     # don't need this anymore, we have tokens from here on
)

Collate

Before we shovel the dataset into the Trainer, we have to group them in batches. Since not all of the conversations are of the same length, we define a pad_token.

tokenizer.pad_token = tokenizer.eos_token

# collate function - to transform list of dictionaries [ {input_ids: [123, ..]}, {.. ] to single batch dictionary { input_ids: [..], labels: [..], attention_mask: [..] }
def collate(elements):
    tokenlist=[e["input_ids"] for e in elements]
    tokens_maxlen=max([len(t) for t in tokenlist])

    input_ids,labels = [],[]
    for tokens in tokenlist:
        pad_len=tokens_maxlen-len(tokens)

        # pad input_ids with pad_token, labels with ignore_index (-100) and set attention_mask 1 where content otherwise 0
        input_ids.append( tokens + [tokenizer.pad_token_id]*pad_len )   
        labels.append( tokens + [-100]*pad_len )    

    batch={
        "input_ids": torch.tensor(input_ids),
        "labels": torch.tensor(labels),
    }
    return batch

Note: the batches do not contain an attention_mask since Mamba does not use the attention mechanism.

Prepare Mamba for 🤗 Trainer

Currently, Mamba hasn’t been added to the Hugging Face ecosystem. The standard Hugging Face trainer needs a forward function that includes a labels argument, which Mamba doesn’t have.

To work around this, we’re implementing a temporary solution by adding a new forward function to the model with a monkey patch. It’s not the most elegant method, but a temporary fix until Mamba becomes part of the Hugging Face transformers library.

# monkey patch MambaLMHeadModel.forward 
def forward_with_loss(self, input_ids, position_ids=None, inference_params=None, num_last_tokens=0, labels = None):
    """
    "position_ids" is just to be compatible with Transformer generation. We don't use it.
    num_last_tokens: if > 0, only return the logits for the last n tokens
    """
    hidden_states = self.backbone(input_ids, inference_params=inference_params)
    if num_last_tokens > 0:
        hidden_states = hidden_states[:, -num_last_tokens:]
    lm_logits = self.lm_head(hidden_states)
    
    # Source: https://github.com/huggingface/transformers/blob/80377eb018c077dba434bc8e7912bcaed3a64d09/src/transformers/models/llama/modeling_llama.py#L1196
    from torch.nn import CrossEntropyLoss
    if labels is not None:
        logits = lm_logits
        # Shift so that tokens < n predict n
        shift_logits = logits[..., :-1, :].contiguous()
        shift_labels = labels[..., 1:].contiguous()
        # Flatten the tokens
        loss_fct = CrossEntropyLoss()
        # shift_logits = shift_logits.view(-1, self.config.vocab_size)
        shift_logits = shift_logits.view(-1, self.backbone.embedding.weight.size()[0])
        shift_labels = shift_labels.view(-1)
        # Enable model parallelism
        shift_labels = shift_labels.to(shift_logits.device)
        loss = loss_fct(shift_logits, shift_labels)
        return (loss,)   
    else:
        CausalLMOutput = namedtuple("CausalLMOutput", ["logits"])
        return CausalLMOutput(logits=lm_logits)
MambaLMHeadModel.forward=forward_with_loss

# patch MambaLMHeadModel
MambaLMHeadModel.forward=forward_with_loss

# (re)load model 
model = MambaLMHeadModel.from_pretrained("state-spaces/mamba-1.4b", device="cuda", dtype=torch.bfloat16)

Alternatively you could use the excellent trainer axolotl or the GitHub repo mamba-chat for training.

Train

from transformers import Trainer, TrainingArguments

bs=4        # batch size
ga_steps=1  # gradient acc. steps
epochs=3
steps_per_epoch=len(dataset_tokenized["train"])//(bs*ga_steps)
lr=0.0005

args = TrainingArguments(
    output_dir="out",
    per_device_train_batch_size=bs,
    per_device_eval_batch_size=bs,
    evaluation_strategy="steps",
    logging_steps=1,
    eval_steps=steps_per_epoch,
    save_steps=steps_per_epoch,
    gradient_accumulation_steps=ga_steps,
    num_train_epochs=epochs,
    lr_scheduler_type="constant",
    learning_rate=lr,
    group_by_length=True,
    bf16=True,                  # mixed precision training
    save_safetensors=False,     # saving will fail without this
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=collate,
    train_dataset=dataset_tokenized["train"],
    eval_dataset=dataset_tokenized["test"],
)

trainer.train()

22.5 GB VRAM allocated with batch size of 4 and sequences 1024 tokens

learning_rate: Probably the single-most important hyperparameter to play with here. As you will see in the next section, my initial choice of learning_rate=0.0005 was poor.

First, let’s see how this finetune turned out (spoiler: bad) and how to fix it.

Evaluation

Chatbot evaluation is hard because the outcome is hard to measure.

What is a good conversational/instruction-following model? There’s more than just one solution to this problem. Until someone figures out how to apply something like this properly, we will have to rely on benchmarks, chatbot arenas and AI judges.

Benchmarks

The Mamba authors published a table of numbers collected using EleutherAI/lm-evaluation-harness. This is what I started with.

Table 3. Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu, Tri Dao, arXiv

I am skeptical about these numbers too. But since I have no experience with Mamba I used them as a starting point to see if the finetune is moving in the right direction.

Model deteriorates with learning rate of 5x10e–4. Same benchmarks as reported in the original paper, Appendix E2.3

We are actually destroying the model with this finetune. As I will try to convince you below, the learning rate (LR) of 0.0005 is too high. There is enough text in this story, I will spare you the actual output of this finetuned model. It has serious trouble talking.

Where to start?

The actual learning rate used for pretraining Mamba is not clear to me. In the paper, the authors state the following:

By default, the peak learning rate is the GPT3 specification.
We give several models an “improved recipe”, inspired by changes adopted by popular large language models such as PaLM (Chowdhery et al. 2023) and LLaMa (Touvron et al. 2023). These include:
• linear learning rate warmup with cosine decay to 1𝑒 − 5, with a peak value of 5× the GPT3 value

GPT3-Paper. Table 2.1. Language Models are Few-Shot Learners, Brown et al., arXiv

Does this mean mamba-1.4b was pretrained with a peak LR of 5x 0.0002 ie. 0.001? Don’t know.

Second try: finetuning with a lower learning rate

Another finetuning trial with a lower learning rate, I decided to go down 10x to 0.00005 (instead of 0.0005).

**blue**=initial finetune with LR 0.0005 (5x10e-4), **green**=second try with LR 0.00005 (5x10e-5).

Lower loss despite lower LR? Yes. No mistake here, I re-ran this to make sure.

Let’s look at the numbers.

LR . lambada_openai_ppl is not shown here: goes down from 5.04 (base) to 4.59 (epoch 3)

This time we are moving in the right direction.

I ignored Goodhart’s law and tried different things to improve this, changing LR, epochs trained, and dataset — none of the following gave me better numbers.

Open Assistant (OA) dataset: lower LRs of 3x10e-5 and 2x10e-5.
OA dataset: more epochs.
Another dataset: HuggingFaceH4/ultrachat_200k. Surprisingly poor performance, in numbers and model “vibes”.

Talking to mamba-1.4b-OA

I continued with mamba-1.4b finetuned on the OA dataset with a LR of 0.00005.

Post-finetuning, Mamba’s conversational skills improved significantly, though it still lagged behind in comparison to TinyLlama (as I will show you later). TinyLlama, however, had undergone more extensive pretraining.

hello
Hello there! How can I help you?
tell me a joke
Ok, here’s a joke:
Why did the tomato turn red?
.
.
.
Because it saw the salad dressing!
tell me about quantum mechanics
Quantum mechanics is a branch of physics that describes the behavior of matter and energy at the atomic and subatomic level. It is based on the principles of quantum theory, which was developed by Albert Einstein in the late 19th century.
In quantum mechanics, particles, such as electrons and photons, can exist in multiple states or locations at once, known as superposition, and can be in a state of uncertainty or randomness, known as wave-particle duality. This is in contrast to classical physics, which describes objects as either being present or not present.
Quantum mechanics also describes the behavior of particles at the subatomic level, including the behavior of atoms and molecules. It is based on the principles of quantum field theory, which was developed by Albert Einstein in the late 19th century.
Quantum mechanics has led to the development of many technologies, such as transistors, lasers, and superconductors, and has important implications for the understanding of the behavior of the universe as a whole, including the nature of dark matter and dark energy.
what’s a photon?
A photon is a particle that is emitted by an atom or a molecule and travels at the speed of light. It is a type of quantum particle, which means it has both a wave and a particle nature.
Photon emission is a process that occurs in certain types of materials, such as diamond, when it is exposed to high-energy radiation, such as light or X-rays. When this occurs, the material emits a single photon in a specific direction.
Photon emission is used in various applications, such as in laser technology, where it is used to generate a beam of coherent light. It is also used in quantum computing, where it is used to create entangled photon pairs, which are used to perform quantum computations.
Overall, photon emission is a fundamental aspect of the behavior of matter and energy, and is an important aspect of our understanding of the universe.
what is the speed of light?
The speed of light in a vacuum is 299,792,458 meters per second (m/s). This constant is defined by the International System of Units (SI) and is used in many equations and calculations in physics, including those related to electromagnetism and quantum mechanics.
is that faster than an airplane?
The speed of light in a vacuum is a constant and is defined by the International System of Units (SI). It is 299,792,458 meters per second (m/s). However, the speed of light in a vacuum is not the same as the speed of an airplane.
The speed of an airplane is typically measured in knots (about 58 miles per hour or 37 kilometers per hour). The speed of an airplane is determined by the airspeed velocity of the aircraft, which is calculated using the following equation:
Airspeed Velocity = ( wingspan / 6 ) + ( length of propellers / 6 )
In this equation, the wingspan is the distance from the nose to the tail, and the length of the propellers is the distance from the propeller hub to the propeller tips.
Therefore, the speed of an airplane is determined by the length of the propellers and the wingspan, and is not directly related to the speed of light.
In contrast, the speed of light in a vacuum is a constant and is defined by the SI unit of metre per second (m/s). Therefore, the speed of light in a vacuum is not affected by the wingspan or length of propellers of an airplane.
an airplane only travels with 37km/h, you sure?
Yes, an airplane travels at a maximum speed of approximately 37 knots (approximately 58 miles per hour or kilometers per hour). This is the maximum speed that an airplane can achieve under normal conditions, and it is defined by the Federal Aviation Administration (FAA) in the United States.
However, it is worth noting that the speed of an airplane can vary depending on various factors, such as the type of aircraft, the weather conditions, and the altitude of the aircraft. In some cases, an airplane may also need to slow down or increase its altitude in order to maintain a specific speed.
Overall, the speed of an airplane is a complex factor that depends on a variety of factors, and it cannot be directly measured in kilometers per hour or any other specific speed.
one mile is not the same as one kilometer!
Yes, you are correct. One mile is not the same as one kilometer. One mile is equivalent to 6,371 meters, while one kilometer is equivalent to 1000 meters.
One kilometer is the distance that is equivalent to one mile multiplied by 1000. This means that one kilometer can be calculated by multiplying 1000 by 1.
One mile is the distance that is equivalent to one kilometer multiplied by 1.
Therefore, one mile is equivalent to 6,371 meters, while one kilometer is equivalent to 6371 meters.
back to quantum mechanics: you talked about photons. i thought light is a wave and not a particle!
Light is both a wave and a particle. Light is a wave at a microscopic level, but at a macroscopic level, light behaves more like a particle.
Light is made up of tiny particles called photons. Photons have certain properties, such as mass and energy, that allow them to travel at the speed of light.
When light reaches a surface, such as a window or a mirror, it interacts with the surface and becomes scattered. The direction and speed of the scattered light particles depend on the properties of the surface, such as its roughness and its refractive index.
The speed of light in a vacuum is a constant and is defined as 299,792,458 meters per second (m/s). This constant is used in many equations and calculations in physics, including those related to light and the behavior of light in different media, such as glass or water.
In quantum mechanics, photons are described as particles that have a wave-like nature. This means that photons can be in multiple states at the same time, and their wave-like properties can cause them to behave like particles.
In quantum mechanics, light behaves more like a wave than a particle. This is because light interacts with matter at the quantum level, and light particles can be in multiple states at the same time.
In quantum mechanics, the speed of light is not a constant. It depends on the properties of the light and the medium through which it is traveling. In some cases, the speed of light can be very close to the speed of light in a vacuum, but in other cases, it can be much higher or much lower.
In some cases, the speed of light can be very close to the speed of light in a vacuum. For example, in a vacuum, light can travel at the speed of light in a vacuum without experiencing any refraction or diffraction.

The quality of this conversation does not knock me off my chair but it’s a good start. Especially when considering that this base model has been trained on 300B tokens only.

🐍 Mamba versus 🦙TinyLlama: Quality and inference speed

TinyLlama is another small model with 1.1B parameters and was trained on 1.5T tokens (so far).

Quality

In the previous section I talked to the mamba-1.4b finetune. I tried to have a similar conversation with the chat version of TinyLlama (TinyLlama-1.1B-Chat-v0.4) which was finetuned on the same dataset. The result is a lot better.

hello
Hi there! How can I help you today?
tell me a joke
What time is it?
don’t know
You probably shouldn’t ask, as I don’t have an answer for that. Is there anything else you need assistance with?
yes, tell me about quantum mechanics
Quantum mechanics is the branch of physics that describes the behavior of matter and energy at the atomic and subatomic level. It was developed in the late 1800s by Albert Einstein and John Archibald Wheeler to explain the behavior of light and other particles in the universe.
In quantum mechanics, particles are described by a wave function, which represents their probability of being in a certain state. This means that even though particles exist in multiple states at once, they always return to their initial state when given a chance. This leads to strange phenomena such as superposition and entanglement, where two particles that are separated apart can be linked through a process known as entanglement.
One of the most famous results in quantum mechanics is the Heisenberg uncertainty principle, which states that it is impossible to know both the exact position and the exact momentum of a particle simultaneously. This principle has important implications for our understanding of the nature of reality and the possibility of creating technologies that could lead to a future without limits on human knowledge.
Overall, quantum mechanics is a fascinating area of physics that continues to challenge our understanding of the world around us. If you would like more information or have any questions, feel free to ask.
what’s a photon?
A photon is a fundamental unit of electromagnetic radiation, consisting of one atom of subatomic particle: an electron. Photons travel at nearly the speed of light (299,792 km/h). They are the primary component of light and play a crucial role in many aspects of modern science, including digital cameras, laser pointers, and x-ray machines. The discovery of the photon was celebrated in 1965 with the publication of the Nobel Prize in Physics awarded to Sir Arthur Eddington, Frank Close, and Robert Woodward “for their discovery of the photon”.
what is the speed of light?
The speed of light in vacuum is approximately 299,792 kilometers per second (or 106 meters per second), also known as c. This value is considered to be the upper limit for the speed of any physical phenomenon in the universe. However, this value does not apply to all forms of matter and energy, which can exhibit faster than light speeds.
is that faster than an airplane?
Yes, if we consider only visible light. A fast moving airplane can reach up to 430 miles per hour, but that doesn’t mean its speed is greater than 299,792 kilometers per second. Airplanes move much slower than visible light.
you talked about photons. i thought light is a wave and not a particlee!
Light is a wave, just like sound waves. When you look into the sun or a mirror, you see light as a wave. But when you look at something like a TV screen or a computer monitor, or when you use a camera, the light is actually made up of tiny particles called photons. These photons flow in patterns, just like the way water flows in a river.
Photons are so-called because they were discovered by the German physicist Heinrich Hertz in 1887 while he was experimenting with telegraphy. Before then, people had believed that light came from stars, but Hertz showed that light was coming from somewhere else — the sun. In fact, the sun emits more light than all the stars put together.
Today, scientists use specialized equipment, such as telescopes and satellites, to study the universe using light from the sun and other stars. They use these images and data to learn about how the universe works and what life might be like out there.

But, again, that’s comparing apples to oranges since TinyLlama had a lot (5 times) more pretraining data.

Inference Speed

A key aspect of Mamba is its reduced resource intensity. The benchmarks below show that Mamba handles large token amounts efficiently, outperforming TinyLlama in speed and memory usage.

The Mamba authors included a benchmark script in their repository.

for model in "state-spaces/mamba-1.4b" "models/TinyLlama-1.1B-Chat-v0.4"
do
  for prompt_len in 100 500 1000 2000 5000 10000
  do
    python3 benchmark_generation_mamba_simple.py --model-name ${model} --promptlen ${prompt_len} --genlen 100
  done
done

Here are the numbers:

100 tokens generated. Average t/s of 3 repeats shown. NVIDIA GeForce RTX 3090 (24 GB VRAM)

The speed is remarkable. Also the fact that with a prompt of 10k tokens TinyLlama runs out of memory (24 GB VRAM) while mamba generates at 100 tokens/sec using 5 GB of VRAM. Impressive.

Long context

Mamba is able to process a 10k prompt with a few GB of VRAM?

Let’s see what the actual output is.

Pasting an entire book into the prompt (136k tokens), asking Mamba to summarise it with the main points. The result: garbage, random tokens.
An random article about triathlon (3.2k tokens): it does produce english text, a summary of the 10 main points, but repetitive and hallucinated.

Went even lower, cut the article in half (1.54k tokens): the result is much better!

A conversation between a human user and a smart AI assistant. The assistant follows instructions closely.
<|im_start|>user
Hello!
<|im_end|>
<|im_start|>assistant
Hello! How can I help you today?
<|im_end|>
<|im_start|>user
I need you to summarize something for me. The folllowing is a text. Please provide a good summary containing the 10 most important points of the text.{text}
<|im_end|>
<|im_start|>assistant
Sure! Here is the summary in 10 main points:
1. Lionel Sanders talks Samorin, St. George, equipment changes and equipment changes on his bike and position.
2. Lionel is champion of The Championship in 2017 and has been a long distance triathlete since then.
3. Lionel is a swimmer, racing the best.
4. Lionel is becoming a’swimmer’ again.
5. Lionel is preparing for a return to Europe.
6. Lionel overcomes travel difficulties to Europe.
7. Lionel will be racing again at The Championship next year.
8. Lionel’s schedule is full but he is continuing to push himself.
9. Lionel’s growth challenges and interests are growing.
10. Summary of Lionel Sanders’ return to Europe, St. George, The Championship and his growth as an athlete.<|im_end|>

Content-wise not perfect (he finished second, not first, in 2017).

The reason why it fails to generate quality content is presumably because it was pretrained with a context length of “only” 2048 tokens (section 4.2.2, mamba paper). So maybe finetuning a small Mamba model such as mamba-1.4b would unlock it’s potential to summarize large texts. That would be nice. Someone should do that.

Last experiment. Let’s see how TinyLlama would process the same 1.54k token prompt.

1. Lionel Sanders talks Samorin, St. George and equipment changes.
2. Experience over pay checks. Racing the best. Becoming a ‘swimmer’. 20-metre draft rule and big changes on his bike and position. It’s always interesting to chat with Lionel Sanders.
3. Experiment with different training regimes. Learn from mistakes.
4. Training hard helps but sometimes you need rest. Take breaks.
5. Mental game is key. Work on self belief.
6. Comebacks happen. You just need to believe in yourself.
7. New challenge in Samorin.
8. Challenges ahead. Goals for 2023.
9. Future plans unclear yet.
10. Lionel Sanders is a champion.
Note: This is only a summary and may contain errors. If you want the exact quote, please ask me directly. :)

Yes, next time we will talk directly! 😆

Summary

Let’s wrap up. From this experiment I learned that:

🐍 Mamba is fast and handles huge amount of tokens.
Finetuning is a bit tricky currently, will improve once integrated into 🤗 Transformers.
🦙 TinyLlama generates better text than Mamba, presumably because it has been pretrained on 5x the amount of data.

Will the snake catch up? We will find out once the community or someone else (pre)trains it on enough tokens. I encourage you to use these findings and code as a base and start toying around with the model yourself! Share your experiences or questions in the comments below or reach out on Twitter.