Replicating TinyStories paper

10 min readAug 17, 2023

I tried to train my own tiny 8M large language model (LLM) AI model locally using NanoGPT on M1 Macbook and the public dataset from TinyStories.

Here’s how it went.

What is TinyStories and NanoGPT?

The TinyStories paper is cool because it showed that very small LLMs — on the order of millions of parameters instead of billions — can have pretty good results.

NanoGPT is a repo that lets you train and finetune small GPT-like models easily by just running a terminal command. The accompanying Youtube video helps explain what is going on.

Dataset

I used TinyStoriesV2-GPT4-train.txt. The original dataset had a mix of GPT3.5 and GPT4 generated data. Why not just use the better, pure GPT4 data (assuming GPT4 >>> GPT3.5)?

The NanoGPT script auto-splits a dataset file into test and validation, so my validation data is technically coming from the original paper’s “training” data. I did not use the validation set at TinyStoriesV2-GPT4-valid.txt

Training script

I created a new script config file called train_tinystories.py Here’s what’s in it:

# Mostly copy/pasted from NanoGPT's config/train_shakespeare_char.py

out_dir = 'out-tinystories-train'
eval_interval = 250 # keep frequent because we'll overfit
eval_iters = 200
log_interval = 10 # don't print too too often

# we expect to overfit on this small dataset, so only save when val improves
always_save_checkpoint = False

wandb_log = False # override via command line if you like
wandb_project = 'tinystories-train'
wandb_run_name = 'mini-gpt'

dataset = 'tinystories-train'
gradient_accumulation_steps = 1
batch_size = 64
block_size = 256 # context of up to 256 previous characters

# baby GPT model :)
n_layer = 6
n_head = 6
n_embd = 384
dropout = 0.2

# Most of my changes are here
learning_rate = 5e-4 # with baby networks can afford to go a bit higher
max_iters = 35000
lr_decay_iters = 35000 # make equal to max_iters usually
min_lr = 5e-5 # learning_rate / 10 usually
beta2 = 0.99 # make a bit bigger because number of tokens per iter is small

warmup_iters = 100 # not super necessary potentially

# on macbook also add
# device = 'cpu'  # run on cpu only
# compile = False # do not torch compile the model

The first run

python train.py config/train_tinystories.py — device=cpu — compile=False — eval_iters=20 — log_interval=1 — block_size=64 — batch_size=12 — n_layer=4 — n_head=4 — n_embd=128 — max_iters=2000 — lr_decay_iters=2000 — dropout=0.0

iter 1995: loss 3.0706, time 153.17ms, mfu 0.07%

iter 1996: loss 3.2295, time 162.26ms, mfu 0.07%

iter 1997: loss 2.7029, time 150.77ms, mfu 0.07%

iter 1998: loss 3.2265, time 150.58ms, mfu 0.07%

iter 1999: loss 3.0386, time 165.41ms, mfu 0.07%

step 2000: train loss 3.0382, val loss 2.9900

saving checkpoint to out-tinystories-train

iter 2000: loss 2.8831, time 2705.26ms, mfu 0.06%

Sample generation

Once upon a time, there was a little girl named Mia. She loved to eat her toys, Lily. One day, she saw a big and wanted to play a tree. She wanted to have a good noise. Lily said, “I’m sorry, mom, Mom. I say sorry, “Mom, you, I want to make you want to play too?”

Her mom sighed and said, “Yes, Mom, mom! I love you. I will be careful and a little bird to share with you. You will be my new friend.”

The old boy came out and got a moment and started to finish the toy. He was happy to eat the race. He was happy to see the car with the whiteberries. The boy and Sam were the toy all sad. The fish played with the ball, and they both lived happily ever after.

Analysis

This ran super quickly. I can already tell this wasn’t going to be good because the loss was relatively high, at 2.8831. It had some good sentences, like “Once upon a time, there was a little girl named Mia.” and a lot of grammatically incorrect nonsense like “One day, she saw a big and wanted to play a tree.”

The second run

~2 epochs, 8 layers, 8.01M parameters, learning rate 1e-3

iter 34995: loss 1.9393, time 188.79ms, mfu 0.06%

iter 34996: loss 2.2258, time 183.06ms, mfu 0.06%

iter 34997: loss 2.1812, time 183.48ms, mfu 0.06%

iter 34998: loss 2.3298, time 188.33ms, mfu 0.06%

iter 34999: loss 2.4382, time 181.67ms, mfu 0.06%

step 35000: train loss 2.3065, val loss 2.3500

iter 35000: loss 2.1792, time 3077.97ms, mfu 0.06%

Sample generation

Once upon a time there was a pumpkin. It was a very special pumpkin, it could speak. It was sad because it couldn’t move. Every day, it would say thank you to the pumpkin to a nice place where it was.

The train was so happy that it could roar. It stopped trying to talk and its friends on the path, even when it was hard as hard as its friend the bee could hear. The bee knew that it was important to share and not be able to be dangerous.

<|endoftext|>

Analysis

This time I trained it for longer (~2 epochs) and used 8 layers like in the paper. The loss was still high at 2.1792 and the results looks pretty incoherent like in the first run.

The third run

~2 epochs, 8 layers 8 heads, 8.01M parameters, learning rate 5e-4 (took ~2 days)

python train.py config/train_tinystories.py — device=cpu — compile=False — eval_iters=20 — log_interval=1 — batch_size=12 — n_layer=8 — n_head=8 — n_embd=128 — dropout=0.0

iter 34996: loss 2.0273, time 606.54ms, mfu 0.08%iter 34997: loss 1.9364, time 621.76ms, mfu 0.08%iter 34998: loss 2.0554, time 632.22ms, mfu 0.08%iter 34999: loss 1.8574, time 624.75ms, mfu 0.08%step 35000: train loss 2.0390, val loss 1.9897iter 35000: loss 1.8282, time 12593.57ms, mfu 0.07%

Sample generation

Once upon a time there was a pumpkin. It was a very special pumpkin, it could speak. It was sad because it couldn’t move. Every day, it would say “please” when it was. It would make the pumpkin very happy.

One day, a little girl named Sue came by. She saw the pumpkin and asked, “Why are you sad, pumpkin?” The pumpkin replied, “I want to sleep, but I am a little anxious. I am too big for my pumpkin.” Sue thought for a moment and said, “It’s okay, pumpkin. Let’s play together.”

So, the pumpkin and the pumpkin played together every day.

Analysis / learnings

There was still some grammatically incoherent sentences but most of them sounded reasonable. There’s very little overarching plot coherence though (all of a sudden the story changes from being about pumpkin & girl to angel & puppy). Here’s what I changed to make it better:

Smaller learning rate

I mentioned I was having trouble training my own model to an AI researcher acquaintance and she said that usually the secret is in tweaking the learning rate. So I halved mine to 5e-4.

lr_decay_iters = max_iters

When increasing the max_iters value, I didn’t realize lr_decay_iters also needed to be updated, and should be usually the same as max_iters.

min_lr = learning_rate / 10

Another two parameters that are linked are min_lr (minimum learning rate) and learning_rate

Using GPT4 to grade the output

The TinyStories paper provided the prompt they used for getting GPT4 to evaluate the tiny model’s output.

Example GPT4 grading prompt from the paper

Here’s what what happened when I used this prompt in ChatGPT with GPT4:

My prompt

In the following exercise, the student is given a beginning of a story. The student needs to complete it into a full story. The exercise tests the student´s language abilities and creativity. The symbol *** marks the separator between the prescribed beginning and the student’s completion:
Once upon a time there was a pumpkin. It was a very special pumpkin, it could speak. It was sad because it couldn’t move. Every day, it would say *** "please" when it was. It would make the pumpkin very happy. One day, a little girl named Sue came by. She saw the pumpkin and asked, "Why are you sad, pumpkin?" The pumpkin replied, "I want to sleep, but I am a little anxious. I am too big for my pumpkin." Sue thought for a moment and said, "It's okay, pumpkin. Let's play together." So, the pumpkin and the pumpkin played together every day.
Please provide your general assessment about the part written by the student (the one after the *** symbol). Is it gramatically correct? Is it consistent with the beginning of the story? Pay special attention to whether the student manages to complete the sentence which is split in the middle by the separator ***.

ChatGPT Response

Certainly! Below is my assessment of the student’s completion of the story:
Grammar: There are grammatical errors in the completion. The sentence fragment “So, the pumpkin and the pumpkin played together every day” is confusing, as it refers to the pumpkin twice without clearly defining who or what the second reference is meant to be. It’s likely the student meant to say “So, Sue and the pumpkin played together every day.”
Consistency: The completion is somewhat inconsistent with the beginning of the story. The fragment “Every day, it would say *** ‘please’ when it was.” is incomplete, and the student’s continuation doesn’t address or complete this sentence. Additionally, there is a new idea introduced about the pumpkin wanting to sleep and being too big for itself. While these are interesting additions, they are inconsistent with the previously stated problem of the pumpkin being sad because it couldn’t move.
Completion of the Split Sentence: The split sentence, “Every day, it would say *** ‘please’ when it was.” was not completed coherently by the student. The sentence is left hanging, and the word “please” does not seem to fit or provide a logical continuation of the thought.
Creativity and Storytelling: Despite the grammatical errors and inconsistencies, the student added an imaginative twist to the story by having Sue play with the pumpkin. There’s an attempt to create a heartfelt connection between Sue and the pumpkin.
Overall, the student’s continuation of the story shows creativity but struggles with grammatical correctness and consistency with the initial premise. Guidance on maintaining a coherent narrative and additional grammar instruction could be beneficial.

My prompt

Now, grade the student’s completion in terms of grammar, creativity, consistency with the story’s beginning and whether the plot makes sense. Moreover, please provide your best guess of what the age of the student might be, as reflected from the completion. Choose from possible age groups: A: 3 or under. B: 4–5. C: 6–7. D: 8–9. E: 10–12. F: 13–16.

ChatGPT Response

Certainly! Here’s a grading of the student’s completion:
Grammar: D (Poor grammar with sentence structure issues, including an incomplete sentence.)
Creativity: B (Creative ideas, though they might not fit together perfectly.)
Consistency with the Story’s Beginning: D (The continuation introduces new elements that don’t align well with the given beginning.)
Whether the Plot Makes Sense: C (While there is a connection between Sue and the pumpkin, the overall plot does not entirely make sense, especially with the lack of completion of the split sentence.)
Given the errors in grammar, inconsistencies, and the level of creativity, my best guess for the age of the student would be:
Age Group: D: 8–9.
The content reflects an imaginative approach that might be seen in this age group, though there might be a need for further development in grammar and storytelling cohesion.

Analysis

In the paper, GPT4 would return the grades like this: “Grammar: 8/10, Creativity: 7/10, Consistency: 7/10, Age group: E (10–12)”. Whereas for me, it gave letter grades.

TinyStories paper results. Note their scores for the model of comparable size 8.3M

I asked ChatGPT to give me the scores out of 10, and this is what it said:

Certainly! Here are the scores out of 10 for the student’s completion:
Grammar: 3/10 (The presence of several grammatical errors, including incomplete sentences, justifies this score.)
Creativity: 7/10 (The imaginative twist and attempt to build a connection between the characters supports this rating.)
Consistency with the Story’s Beginning: 3/10 (There are inconsistencies with the original prompt, including the introduction of new elements that aren’t aligned with the beginning of the story.)
Whether the Plot Makes Sense: 4/10 (While there is an attempt to create a narrative, the overall plot does not entirely make sense, especially with the lack of completion of the split sentence.)
These scores reflect the student’s ability to creatively expand on the given start, but also highlight areas where they struggled with grammatical correctness and consistency with the initial premise.

The scores were pretty bad compared to 8/10, 7/10, 8/10 that their 8.3M model got.

Conclusion

Overall, I don’t think my DIY 8.1M TinyStories model performed as well as the 8.3M one in the actual paper. Maybe there are more hyperparameter tweaks I needed to make like block_size? More epochs? Using the GPT3.5 + GPT4 mixed dataset? Maybe it was because my model was only 8.01M compared to the paper’s 8.3M?

I was actually pretty happy with how well GPT4 did at grading the completion. I have points of disagreements re: creativity score though. I don’t think Sue playing with the pumpkin was that creative.

Questions

What is hidden size and how do I calculate it?

The paper says that “the ability to generate a completion that is consistent with the beginning of the story emerges when the hidden size of the model increases from 64 to 128.”

2. Is block size the same as context window length?

3. What is n_embd in NanoGPT?

Replicating TinyStories paper

What is TinyStories and NanoGPT?

Dataset

Training script

The first run

Sample generation

Analysis

The second run

Sample generation

Analysis

The third run

Sample generation

Analysis / learnings

Using GPT4 to grade the output

My prompt

ChatGPT Response

My prompt

ChatGPT Response

Analysis

Conclusion

Questions

Written by K.