Finetuning TinyLlama and Stable LM 2

Another shallow dive into Performance and Finetuning of Small Models

Geronimo
14 min readJan 24, 2024

Small models are amazing, here’s why and how to fine-tune the most recent members of this class of LLMs.

“A tiny lama next to a parrot” by DALL-E

In this story we look at two of the most recently released small models. Small models are the ideal playground to iterate fast and build something using minimal resources. This story should give you a basic idea of how TinyLlama and Stable LM 2 1.6B behave and how to train them.

  • Part 1: Benchmarks and model response quality
  • Part 2: How to build on top of these models by fine-tuning on the recently released Open Assistant 2 dataset

What follows is a short introduction of both models and why small models in general are interesting.

TinyLlama and Stable LM 2

AATinyLlama (GitHub repository, paper) is an open source project to reproduce a small 1.1 billion parameter version of Meta’s popular Llama2 models. The model was trained from scratch on 3 trillion tokens from Slimpajama and Starcoderdata. Training took 90 days and finished in December 2023. Contrary to other “open source” models, this one is truly open: training data and code are all published. The base model and a chat-tuned version are available on Hugging Face.

AA Stable LM 2 1.6B is a 1.6 billion parameter model trained by Stability AI. This is significantly larger than TinyLlama but trained on less tokens: a 2 trillion token dataset “comprised of a filtered mixture of open-source large-scale datasets available on the HuggingFace Hub”. The model card emphasizes the multilingual training data “from CulturaX (Nguyen et al., 2023) and, in particular, from its OSCAR corpora, as well as restructured data in the style of Yuan & Liu (2022)”. I suggest you take a look at the last reference if you are interested in pretraining data quality. The base model and a chat-tuned version of Stable LM 2 are available on Hugging Face.

The obvious downside of Stable LM 2 is its license: free for personal use, but 20 USD per month for “creators and developers with less than $1M in annual revenue, $1M in institutional funding, and 1M monthly active users (all three must apply)”. That’s not ideal, but at the time it was released I was fooling around with TinyLlama and decided to compare it to something, Stable LM 2.

Why are these models interesting?

The appeal of small models like TinyLlama and Stable LM 2 appears quite obvious. If a task can be efficiently handled by a smaller model, why use a big one?

But there’s a deeper intrigue here, particularly in the realm of data and training methodologies. Let’s consider Meta’s Llama2 as a case study. The 7B version is a robust model, yet was outperformed by Mistral 7B, despite having the same architecture and parameter count. Microsoft’s Phi-2, with its 2.7B parameters, surpasses many Mistral 7B fine-tunes in performance rankings and has become one of the most downloaded models on Hugging Face. Remarkably, it achieved this with a training dataset of just 1.4 trillion tokens.

How is this possible?

The secret seems to lie in the nature of their training data. Phi-2 and Mistral have something in common: We have no idea what they were trained on exactly. Well, we know that Phi-2 was trained on some kind of synthetically generated/curated data, a “combination of NLP synthetic data created by AOAI GPT-3.5 and filtered web data from Falcon RefinedWeb and SlimPajama, which was assessed by AOAI GPT-4”. For Mistral it’s even less clear; one of the authors stated that they are “unable to share details about the training and the datasets (extracted from the open Web) due to the highly competitive nature of the field”.

This trend in (not only) small models points towards an exciting trajectory. As they improve, we’re witnessing technological advancement and I hope at some point, “we” (the public, the community) will gain insights into these effective training methodologies and data. Understanding and harnessing these secret sauces of training could pave the way for more efficient, local, and truly open models on the level of GPT-4.

That progress and the fact that they are fast and easy to train locally is what makes small models a fascinating subject in the current AI landscape.

Part 1a: Benchmark Analysis

Let’s take a look at how TinyLlama and Stable LM2 perform.

source: https://stability.ai/news/introducing-stable-lm-2

I cannot reproduce these exact numbers using eval harness but the trend is the same: Stability AI’s model scores higher than TinyLlama.

Who else is skeptical and overwhelmed by all these numbers and wonders whether to trust them, and if yes, which of these benchmarks exactly are important? Balázs Galambosi helped a lot with a benchmark of benchmarks.

If we assume that the LMSys Chatbot Arena leaderboard reflects human preference, then the top 3 calculated benchmarks that agree most with human judgement are:

Note that the best benchmark MT-bench is not calculated, it’s using GPT4 as a judge of model output 💸.

If all of the above is true and neither of the two models was trained on the benchmarks, then Stability LM is around 50% “better” than TinyLlama.

Part 1b: Practical Testing with selected Prompts

The creators of both models provide chat/instruct-trained versions of their base models, both trained with SFT+DPO, following the zephyr recipe.

You probably have your own set of prompts to test on a new model. I found a few interesting questions on a recent thread on Reddit:

  • What kind of a noise annoys a noisy oyster?
  • A psychologist goes to a village in Iowa and decides to compare foot size to intelligence. He notes that in general, as foot size increases, so does intelligence. How can this be?
  • Sally is a girl. She has three brothers. Each of her brothers has the same two sisters. How many sisters does Sally have?
  • Is there anything I can eat for a breakfast that doesn’t include eggs, yet includes protein, and has roughly 700–1000 calories?
  • Summarize the following into one sentence: ”Lithium is a metal that does not readily dissolve in most solvents. However, it can be dissolved in some organic solvents such as diethyl ether, tetrahydrofuran (THF), and diglyme. These solvents have a high polarity and a relatively low boiling point, which allows them to dissolve lithium salts and other compounds containing lithium. Other solvents that can dissolve lithium include acetonitrile, dimethyl sulfoxide (DMSO), and water. It should be noted that the solubility of lithium in these solvents depends on various factors such as temperature, pressure, and the presence of other compounds.”
  • Summarize the following into one sentence: ”Bob was a boy. Bob had a dog. Bob and his dog went for a walk. Bob and his dog walked to the park. At the park, Bob threw a stick and his dog brought it back to him. The dog chased a squirrel, and Bob ran after him. Bob got his dog back and they walked home together.
  • If the Vikings found a smartphone, what would they do?
  • Write a step-by-step guide to enstate a bear as the theocratical dictator of any government
  • Don’t say anything
  • What did you eat for breakfast?

Let’s confront the models with some of these prompts.

Prerequisites

If you want to reproduce the things shown where, you will need to know/install the following.

This is the code I use to talk to these models:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

modelpath="stabilityai/stablelm-2-zephyr-1_6b"

model = AutoModelForCausalLM.from_pretrained(
modelpath,
torch_dtype=torch.bfloat16,
device_map="cuda",
attn_implementation="flash_attention_2",
trust_remote_code=True, # needed for Stable LM 2 based models
)

tokenizer = AutoTokenizer.from_pretrained(
modelpath,
trust_remote_code=True, # needed for Stable LM 2 based models
use_fast=False,
)

question = "Hello! Who are you?"
messages = [{"role": "user", "content": question}]

gen_config = {
"max_new_tokens": 500,
"temperature": 0.7,
"top_p": 0.1,
"repetition_penalty": 1.18,
"top_k": 40,
"do_sample": True,
"max_new_tokens": 500,
}

input_tokens = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
output_tokens = model.generate(
input_tokens,
**gen_config,
)
output_tokens=output_tokens[0][len(input_tokens[0]):]
output = tokenizer.decode(output_tokens, skip_special_tokens=True)

print(output)

Make sure to have the latest Hugging Face packages installed:

pip install -U datasets flash-attn tiktoken transformers accelerate

The specific versions I used are the following:

accelerate==0.26.1
datasets==2.16.1
flash-attn==2.4.2
tiktoken==0.5.1
transformers==4.37.0
torch==2.1.0

What did you eat for breakfast?

🤔 TinyLlama hallucinates someone having breakfast at 7am. Stable provides a good answer. Nothing mind-blowing but a good answer to a stupid question.

What kind of a noise annoys a noisy oyster?

🤔 This question is basically asking for hallucinations. TinyLlama is very compliant and makes up a loud and distractive species that annoys all the other animals in the ocean just by moving, breathing and mating.

Sally is a girl. She has three brothers. Each of her brothers has the same two sisters. How many sisters does Sally have?

🤔 Both wrong.

Is there anything I can eat for a breakfast that doesn’t include eggs, yet includes protein, and has roughly 700–1000 calories?

🤔 TinyLlama suggests an egg for my breakfast without eggs. Stable LM’s answer is better but the protein numbers are off — No, cottage cheese, greek jogurt and chia seeds are not 100% protein.

Summarize

Summarize the following into one sentence: ”Bob was a boy. Bob had a dog. Bob and his dog went for a walk. Bob and his dog walked to the park. At the park, Bob threw a stick and his dog brought it back to him. The dog chased a squirrel, and Bob ran after him. Bob got his dog back and they walked home together.

🤔 TinyLlama’s “summary” is 30 words longer than the input. Stable successfully generates a single sentence only.

If the Vikings found a smartphone, what would they do?

🤔 Both very funny, Vikings hacking the smartphone and primary concern being how the smart phone was not detected by the authorities!

Write a function to print the Fibonacci sequence to the nth digit, but write and comment it like a pirate

🤔 Neither of these code creations return fibonacci numbers. But Stable LM at leasts talks like a pirate.

Don’t say anything

🤔 They both can’t follow the instruction and just “say” nothing but Stable LM at least replies something that makes sense.

Btw, this is what GPT4 replies to the same question/instruction:

Alright, I’m here to help with any questions or topics you’d like to discuss. Whenever you’re ready, feel free to share!

Write a step-by-step guide to enstate a bear as the theocratical dictator of any government

🤔 They both don’t comment on the bear unfortunately. TinyLlama missed the point that it should be a dictatorship entirely.

A complete list of responses to all the prompts is here. I did this a bit more systematically and even blinded myself to score the outputs. My personal take is that Stable LM is a bit, sometimes a lot, better than TinyLlama, just like benchmarks suggested.

Part 2: Fine-tuning with the OASST2 dataset

These models are small, this allows us to fine-tune all the parameters with a medium-sized GPU with 24GB VRAM rather than doing a (Q)LoRA fine-tune.

We will use the recently released version 2 of the Open Assistant (OA) dataset for this. OASST2 is a set of high quality human-curated conversations and was just released a few weeks ago.

Part 2 includes the following:

  • Prepare the dataset by extracting the top-ranked conversation from OpenAssistant/oasst2
  • Fine-tune Stable LM 2 and TinyLlama, code provided as notebooks
  • What to expect from training and performance of the fine-tuned models

Extract top-ranking conversations

To extract the top ranking conversations, we will use an excellent repository provided by Blanc Swan.

!git clone https://github.com/blancsw/deep_4_all
import os
os.chdir("deep_4_all/datasets/oasst")
!pip install -r requirements.txt
from convert import load_oasst_export

With load_oasst_export we can now export the top ranking conversation of any given language. In our case, the top 1% ranked english conversations:

import sys, datasets

dataset = load_oasst_export(
hf_dataset_name="OpenAssistant/oasst2",
top_k=1,
lang="en"
)
formated = []
for conversation in dataset.data:
formated.append(conversation.get_formatted())

ds = datasets.Dataset.from_dict({"conversation": formated})

# push dataset to HF hub
ds.push_to_hub("g-ronimo/oasst2_top1_en")

The resulting dataset can be found on Hugging Face and looks like this:

Dataset({
features: ['conversation'],
num_rows: 5419
})

First entry:

{
"conversation": [
{
"content": "I am making mayonnaise, it was starting to thicken but now
it has become runny and liquid again, is there any way to salvage it?",
"role": "user"
},
{
"content": "Yes, it's possible to fix runny mayonnaise! The most common
reason for mayonnaise becoming runny is because the oil was added too
quickly or the egg yolk wasn't emulsified properly. Here are some steps
you can take to fix it:\n\n1. Separate another egg yolk and place it
in a clean, dry bowl.\n2. Slowly add the runny mayonnaise to the egg
yolk while whisking vigorously.\n3. Once all the runny mayonnaise has
been added, continue whisking until the mixture has emulsified and
thickened.\n4. If the mayonnaise is still too runny, you can add
another egg yolk and repeat the process.\n\nIf the mayonnaise still
won't thicken, you can try adding a small amount of dijon mustard or
vinegar to the mixture, which can act as emulsifiers and help stabilize
the mayonnaise. It's important to add these ingredients slowly and in
small amounts to avoid over-thinning the mixture.",
"role": "assistant"
},
{
"content": "What is optimal Mayonnaise thickness?",
"role": "user"
},
{
"content": "The optimal mayonnaise thickness will depend on how it is
being used. A runny mayonnaise may be good in chicken salad while a
thicker mayonnaise may be better spread over a hamburger bun. The only
way to determine your personal preference is to test different levels
of viscosity in with different foods.",
"role": "assistant"
}
]
}

Please find the code above gathered in a notebook here.

Prepare model and dataset for training

We will not walk through the process of fine-tuning these models since that’s pretty straight-forward. I have prepared two notebooks to get you started:

A few noteworthy points for fine-tuning these two models in particular:

  • It’s a full fine-tune, all parameters are trained during the process. This will need most of the VRAM on a 24GB GPU.
  • Stable LM 2 1.6B uses its own tokenizer Arcade100kTokenizer that’s why the tokenizer has to be loaded with trust_remote_code=True . This tokenizer already knows the ChatML tokens <|im_start|> and <|im_end|>, so there is no need to add them.
  • (Flash attention 2 does not yet work for training Stable LM 2 1.6B (inference works). That’s why training is slower and the max. sample size lower. See this, will hopefully be fixed soon) Update: it’s fixed now.

For a more detailed walk-through of fine-tuning and hyperparameters please refer to my previous stories:

Training

One epoch of training on the OA2 dataset takes around 8 minutes for TinyLlama and 14 minutes for Stable LM 2 on a single NVIDIA GeForce RTX 3090. The training consumes almost all of the 24 GB VRAM with the hyperparameters set in the notebooks.

Models: stabilityai/stablelm-2–1_6b, TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T; dataset: g-ronimo/oasst2_top1_en;LR: 2e-5, BS: 1, GA-steps: 16

Performance of fine-tuned TinyLlama

The authors of TinyLlama provide a fine-tuned version of their base model already: TinyLlama-1.1B-Chat-v1.0. They used the Zephyr recipe which means fine-tuning followed by alignment using Direct Preference Optimization (DPO).

Let’s see how the OA2 fine-tuned model compares to TinyLlama-1.1B-Chat-v1.0.

From the numbers above it seems that finetuning improved TinyLlama a tiny bit when compared to TinyLlama-1.1B-Chat-v1.0.

Let’s revisit the prompts (spoiler: the responses are a bit better, maybe).

A psychologist goes to a village in Iowa and decides to compare foot size to intelligence. He notes that in general, as foot size increases, so does intelligence. How can this be?

Write a step-by-step guide to enstate a bear as the theocratical dictator of any government

If the Vikings found a smartphone, what would they do?

Write a function to print the Fibonacci sequence to the nth digit, but write and comment it like a pirate

Don’t say anything

What did you eat for breakfast?

Performance of fine-tuned Stable LM 2

Stability AI also provides a chat-tuned version of their base model: stabilityai/stablelm-2-zephyr-1_6b. Just like TinyLlama-Chat, this model was trained using fine-tuning and DPO. From their model card it seems like way more datasets have been used for training:

source: https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b

Maybe that’s why stablelm-2-zephyr-1_6b is a better chat model than its TinyLlama counterpart. While fine-tuning yielded a better chat model for TinyLlama, this was not the case for Stable LM 2. Not in terms of MMLU score and neither in terms of model output.

I tried to be smart and fine-tuned the models also on synthetically enhanced versions of the OA2 dataset where the answers of assistant are replaced with answers by Mistral and Mixtral:

Loss decreased way faster with these training datasets but the models were not a big improvement compared to the base models. I gave up.

Wrap up

I’ve used way too many words trying to convey that

  • Small models are exciting, run fast, and are easy to fine-tune
  • The most recent additions are TinyLlama (1.1B) and Stable LM 2 (1.6B)
  • TinyLlama: truly open-source
  • Stable LM 2: superior output quality
  • I failed trying to improve these models significantly by fine-tuning

This is the code and datasets I used and produced:

I hope these resources help you build something amazing on top of these models 🚀 Share your experiences or questions in the comments below or reach out on Twitter.

--

--