Train from scratch and Fine-tune an Instruct Llama2 model in PyTorch

6 min readAug 29, 2023

Recently, Andrej Karpathy published a self-contained repository (llama2.c) to train a small version of Llama2 in Python and PyTorch that generates tiny stories. You would for example input the beginning of a sentence:

Once upon a time, Sam

and the model would continue with a story

Once upon a time, Sam the rabbit was out playing in the forest. He wanted to try and reach the top of a big tree. He wanted to be able to see the whole forest from up there. He tried his best, ….

I trained and fine-tuned my own models so that they are able to follow instructions and write tiny stories accordingly. Here is a demo:

In comparison, when you give the same input to Karpathy’s model stories110M.bin, here is the output:

Write a story. In the story, try to use the verb “run”, the noun “flower” and the adjective “clever”. Possible story: wrote: “You should never follow anyone’s nose”. One day, a little girl named Lucy was feeling very restless. She wanted to do something exciting. So, she decided to follow her nose! At first, Lucy’s nose was easy to follow. But, soon, she was going too fast and getting lost. She tried to write the name of the event, but she couldn’t find it. Lucy felt very upset. Just then, Lucy saw a friendly old lady. She ran up to her and asked for help. The old lady told Lucy to follow her nose. Lucy followed the nose, and it lead her to a funny, wonderful place. The old lady taught Lucy an important lesson: when you go exploring, follow your nose. It always helps you to find your way. That’s why Lucy followed her nose, and now she was very happy.

In this blogpost, I share how I trained and fine-tuned models to follow simple instructions. For more details, you can refer to the code.

The TinyStories dataset

The dataset used is the TinyStories dataset. It contains short stories that have been generated by GPT-3.5 and GPT-4, using prompts like this:

Write a short story (3–5 paragraphs) which only uses very simple words that a 3 year old child would understand. In the story, try to at some point use the verb “hope”, the noun “search” and the adjective “comfortable”. Remember to only use simple words! Possible story:

a generated story would then be:

Once upon a time, there was a little girl named Lily. She had a teddy bear that she loved so much. One day, she lost it while playing in the park. She looked everywhere, but she couldn’t find it. She felt sad and scared without her teddy bear.
Lily’s mommy saw her crying and asked what was wrong. Lily told her that she lost her teddy bear. Mommy hugged her and said, “Don’t worry, we’ll search for it together.” They went back to the park and looked everywhere. After a while, they found the teddy bear under a tree. Lily was so happy!
She hugged her teddy bear and felt comfortable again. She said, “I hope I never lose you again, teddy bear.” Mommy smiled and said, “Me too, Lily. You and teddy bear are the best of friends.” And they all went home, happy and content. The end.

In Karpathy’s repo, the models are trained only on generated stories (and not on prompts). He concatenated all the stories together, and trained the network to do next token prediction.

That way, when you start with “Once upon a time, Sam”, the model outputs a coherent story with the beginning of this sentence. However, when you input an instruction like “Write a story. In the story, try to use the verb “run”, the noun “flower” and the adjective “clever”. Possible story:”, the model is lost since it has never seen anything like that in the training set.

Building the Instruction Dataset

Incorporating the prompts in the training set

Since the prompts are available in TinyStories, we can use those to train are instruct model.

My first idea was just to concatenate the prompts and the generated stories and just train for next token prediction like Karpathy. However with this method, the network also learnt to predict the prompt, and therefore, didn’t yield satisfying stories.

What worked better was: still train on doing next token prediction, but this time, mask the predicted prompt tokens in the loss. The model would therefore not be rewarded for predicting good prompts.

More specifically, I concatenated each pair (prompt + story). This would do one sample X, fed to the model. One sample in Y would then be shift(masked_prompt + story). The “shift” is to shift tokens by one, for next token prediction. For masking, I just replace the prompts token by ‘-1’ and add ‘-1’ to the ignore_index argument of PyTorch torch.nn.functional.cross_entropy function.

Prompts preprocessing

Because the initial goal of the prompts in TinyStories were to generate stories from GPT models, they are a bit verbose. I transformed the prompts. For example,

Write a short story (3–5 paragraphs) which only uses very simple words that a 3 year old child would understand. In the story, try to at some point use the verb “hope”, the noun “search” and the adjective “comfortable”. Remember to only use simple words! Possible story:

would become

Write a story. In the story, try to use the verb “hope”, the noun “search” and the adjective “comfortable”. Possible story:

I did this first because the model is specifically trained to only output short stories that “use very simple words that a 3 year old child would understand” so it adds no information. And second, because it allowed me to use a maximum sequence length of 350 tokens and still keep each pair (prompt + story) complete without truncating stories. I removed all samples where the number of tokens for (prompt + story) was larger than 350, and I still could keep 94% of the dataset.

I then padded all samples that were smaller than 350.

Training and Fine-Tuning

I trained two models using the dataset I’ve described in the last section.

The first is a 15M parameters model that I trained from scratch.

For the second, I took Karpathy’s 110M parameters model that was trained to generate stories (without instruction) and I fine-tuned it with LoRA. For LoRA, I took inspiration from wlamond’s PR. It uses PyTorch’s parametrizations feature which is very handy. I used a rank of 2 and added LoRA to Wq , Wk , Wv , Wo. I chose this because in LoRA paper, it seemed like a rank of 2 applied to Wq , Wk , Wv , Wo already yielded good results.

In the end, it seemed like the fine-tuned model performed better at crafting stories following the prompt. This was not surprising since it is larger and had been pre-trained longer by Karpathy. (I trained it for ~5hrs to get a loss of ~0.8)

Generation and Limits

For the generation part, I stop the model whenever it outputs an end of sequence token, or a ‘The end’. I would say it worked 70% of the time, but sometimes, the model continue with a second story.

Currently, the models only support prompts like ‘Write a story. In the story, try to use the verb “{verb}”, the noun “{noun}” and the adjective “{adj}”. The story has the following features: it should contain a dialogue. Possible story:’, that is, prompts that look like the one in the training set. Plus, in order for the story to make sense, the verb, noun and adjective given must be common words that are present in the training set. It would be interesting to make the dataset more diverse.

Conclusion

Voilà! That’s how I trained my models to follow simple instructions to write tiny stories. You can try it on your own, by following the README of the repo. There are lots of things to improve of course and the models make mistakes, but I thought it was cool that it worked pretty decently!