The Art of Hyperparameter tuning a GPT: Finding Harmony between Human-Machine Evaluation (Part 1 of…

15 min readNov 9, 2023

The Art of Hyperparameter tuning a GPT: Finding Harmony between Human-Machine Evaluation (Part 1 of 2)

Embarking on a journey through the realms of neural networks, you might have heard of the bombshell of a video in the form of Andrej Karpathy’s “GPT from scratch”. At this point you may have watched the video, or even replicated the code, after which you’d be inclined to ask: what’s next? How can we nurture embryonic raw neural network model, guiding its evolution, refining its predictions, and cultivating texts that mimic human-written expressions?

Hi! i’m stiww weawning human! how awe uwu today?

Welcome to the artful science that is hyperparameter tuning, where data science meets artistry. Here, choreographers of algorithms, data folk like you and I, weave a ballet, orchestrating metrics to dance to the unpredictable rhythms of human language and intent, composing performances that resonate with life’s multifaceted narratives.

TL;DR “Would the output suffice for broader real-life use by people?”

In this two-part series, we raise the curtains, and focus the spotlight on the “Tiny Shakespeare” dataset, as introduced by Karpathy’s oeuvre. On the stage, we have a cast of characters — letters and symbols — eager to perform a choreographed sequence, ever-ready to craft a theatrical text.

Bridging the gap between human and machine understanding

That said, the reality, like our humanity, is more nuanced then that. Machines do not understand and capture meaning the same way the human brain does (at least at time of writing…). Consequently, moving the output from gibberish to a collage of nuanced articulations requires that the model be guided by the subtle touches of human intuition and the precision of metrics using hyperparameters.

Don’t mind me! Just tuning some hyperparamters!

Setting the stage: Primer on Transformers and Character-level tokenizers

Let’s acquaint ourselves with the leading characters of the play: The Transformer and Character-level Tokenizer.

If “manners maketh a man”, “Transformers maketh a GPT”

Transformers have gained traction as the language modelling approach of choice due to it’s ability to process multiple data parts simultaneously. For our use case this translates (pun not intended) to better text generation. This is achieved using:

Attention Mechanism: Allows the model allows to focus on different parts of the input text, paying varying levels of attention to different words, enabling it to capture the intricacies of language and context.
Layered Composition: The model has many layers, or “heads”, which work in tandem, each contributing to the learning process, enabling the Transformer to understand and generate text with remarkable complexity and nuance.

Arrest this man! He talks in Maths!

Tokenizers are, in essence, a translator, which takes human language and translates it into numbers, the lingua-franca of machines. That said, there is catch. The tokenizer also has a critical role in finding the balance between meaning and computational efficiency. In this case, each character (letter, punctuation mark, space, etc.) becomes a token, an individual element that the model can learn from and generate.

Viewing Hyperparameter Tuning as Artistry

Intuition Over Algorithms: Hyperparameter tuning leans into a practitioner’s intuition, akin to how an actor adds a unique flair to a character.
Unique Solutions for Unique Problems: Just as every artwork is unique, hyperparameter tuning requires a bespoke strategy for every dataset and model.
Trial and Iteration: The tuning process requires frequent reassessments and re-calibrations, till the practitioner achieves a state of (near-)alignment to the desired outcome.
Balancing Objectivity and Creativity: While metrics are crucial, there is space for out-of-the-box thinking to achieve peak performance.

Model Architecture explained

In Andrej Karpathy’s ‘GPT from scratch’ video, we were introduced to the following model.

Simple Transformer, with a Character-Level Tokenizer: A Transformer-based neural network model, that learns and predicts text by examining interactions between individual characters
(e.g. ‘A’, ‘a’, ‘ ‘, ‘!’, ‘?’).

Simplified encoder-decoder transformer architecture

Delving into Hyperparameters

Fine-tuning a GPT-like model (or Large Language Models, in general) requires, at minimum, a working understanding of hyperparameters. Their mastery is vital to fully harness these models’ capabilities; this article focuses on the following:

Learning Rate: Sets the optimization pace, how much to update the model weights at each iteration, of the model.
Batch Size: Defines data samples per training iteration. Larger sizes stabilize gradient estimates, while smaller ones add randomness.
Model Architecture (Layers, Units, Attention Heads): The architecture, including the number of layers, hidden units, and attention heads, defines the model’s capacity and capability for data pattern recognition.
Dropout Rate: Dropout is a regularization technique that prevents overfitting by randomly deactivating neurons during training.

The Trade-offs to consider

Model Capacity vs. Overfitting: More capacity can recognize intricate patterns but risks overfitting.
Batch Size Dynamics: Smaller batches add randomness but aid regularization. Larger batches stabilize gradients at the cost of memory.
Learning Rate and Convergence: Setting the learning rate too high can destabilize training; too low can slacken the pace.
Regularization vs. Underfitting: Hyperparameters, like dropout, help in striking the balance between over and underfitting.

In this guide, we’ll navigate the training journey by closely observing the loss values and directly analysing model outputs. Through visual snapshots, we will witness the tangible enhancements in Model 1 as we tweak parameters. However, with Model 2, there is a twist! As the loss values dip to certain levels, the outputs, while becoming seemingly intricate, takes a detour into the land of gibberish. It’s all part of “IID”, the “Intricate Iterative Dance” that is hyperparameter tuning!

Code Walkthrough

Before we roll up our sleeves…

“Minimum” system requirements: I ran this walk-through on a 2017 2.5 GHz Dual-Core Intel Core i7 MBP with 16 Gigs of RAM. If you have a modern setup, you’re more than good to go!
Programming Basics: The next sections assume a working knowledge of python and jupyter notebooks
Grab The Code: All the resources and code are available at my git repo

Let the tuning begin!

Loading in the dataset

To keep the spirit of this article’s inspiration, we begin by loading in the “Tiny Shakespeare” dataset, which we will be working with.

# Dataset we are working with contains all the works of Shakespeare concatenated together
!curl -O https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

Importing required packages

To shape, mold, and refine our data and model for this project, we would be relying on the following Python libraries.

torch: Our main library for building and training neural networks.
torch.nn: Contains pre-defined layers, loss functions, and optimization methods.

# Importing required packages
import torch
import torch.nn as nn
from torch.nn import functional as F

The Prologue: Setting the Stage (Hyperparameters)

Before we begin our performance (training!), we need to set the stage, literally. Our model’s behavior and training process are governed by the hyperparameters specified.

Model Architecture: The foundational design of our model. The choices here influence the model’s ability to grasp and generate complex text patterns.
Training Dynamics: These parameters guide how our model learns from the data. It’s akin to the rehearsal process where actors refine their roles.
Training Control: Directs the overall training process, specifying, for instance, when to take a break and evaluate the performance.
Computational Setup: Do you have a GPU? If not, a CPU will suffice.
Early stopping setup: An automated director that halts the training if our model isn’t improving, saving us both time and computational resources.

## Model Architecture
n_embd = 64  # The depth of understanding the model can achieve
n_head = 4  # Balancing breadth and depth of focus
n_layer = 4  # Layers of abstraction

## Training Dynamics
learning_rate = 1e-3  # Speed at which the model "learns"
batch_size = 16  # Number of samples to process in one iteration
block_size = 32  # Sequence length to process in each iteration
### Total tokens = batch_size * block_size
dropout = 0.0  # Going wabi-sabi, embracing beauty in imperfection

## Training control
max_iters = 5000  # Hard stop number for training iterations
eval_interval = 100  # How often to evaluate model on training set
eval_iters = 200  # Number of "text chunks" to sample 
## Computational setup, for all you NVIDIA cool kids
device = 'cuda' if torch.cuda.is_available() else 'cpu' 

## Early stopping setup 
## Terminate training if validation performance doesn't improve
patience = 10  # Number of iterations to wait without improvement
best_val_loss = float('inf')  # Track the best validation loss

Prelude to Creation: Pre-Processing Rituals

Now that our stage is set, we need to prepare our script, the dataset. This involves converting our textual data into a language that our model can understand and process.

Reproducibility: Reproducibility is the cornerstone of consistency in science. Setting a manual seed, empowers you, the reader, to replicate the “randomness” in my code. You can turn this off if desired.
Encoding: Our model does not understand language in the same way that we do; it requires an encoding scheme to understand the text. By mapping characters to unique integers, we create a bridge between human-readable text and machine-processable numbers.
Train-test split: To evaluate our model’s performance impartially, we reserve a part of our dataset. This ‘unseen’ data acts as a judge, assessing how well our model performs beyond the training script.

# Set the random seed for reproducibility
torch.manual_seed(1337)

# Pre-process data
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Translating raw text to a language our model understands  
chars = sorted(list(set(text)))
vocab_size = len(chars)
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s]  
decode = lambda l: ''.join([itos[i] for i in l])  

# Train-test split: 90% training and 10% validation
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

Setting the Scene: Under the hood of our GPT

If our SimpleGPT module was a play, then every function, class, down to every line of code sets the stage, paving the way for an intricate performance of calculations and operations.

get_batch : Our dependent stagehands, working tirelessly behind the scenes, ushering our the data (the actors) to where they need to be by selecting the right batches of data, arranging them in order, and getting them ready for the spotlight.
estimate_loss : The perceptive director constantly ensuring that the performance is on track, guaging the accuracy of the model to ensure that we get the desired output.
Head, MultiHeadAttention, FeedForward & Block classes: Classes embody the protagonists and the supporting cast, each with their roles, ensuring the narrative flows well. Much like the intentionally convoluted Shakespearean plot, these classes interact, share data, and transform it, working in harmony to produce the final text.
SimpleGPT: This is the play itself. Using embeddings, multiple transformer blocks and layer normalization, the model takes in the data and produces a tale of its own!
Optimizer: Think of this as the critic during the rehearsal; this line of code reviews the performance of our tale, gauging where it shines and where it stumbles. It then provides feedback, suggesting tweaks to make the next rehearsal even more compelling.

# Function to generate mini-batches for training and validation
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# creating a simple transformer
class SimpleGPT(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = SimpleGPT()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

Striving for perfection with each passing rehearsal

For every interval in the play, our model rehearses its steps, fine-tuning with every iteration. This is akin to an actor striving have his true self disappear, he is only able to fully encapsulate that character after arduous rounds of feedback, which is analogous to how our model tweaks its understanding with each iteration.

This however, has to be balanced with the law of diminishing marginal returns; much like the director “just knows” when further rehearsals would not swing the needle on performance.

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        
        # Check for early stopping
        if losses['val'] < best_val_loss:
            best_val_loss = losses['val']
            interval_without_improvement = 0  # reset the count
        else:
            interval_without_improvement += 1

        if interval_without_improvement >= patience:
            print("Early stopping due to no improvement in validation loss.")
            break

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

The “Finale”: Text Generation Showtime?

After rigorous rehearsals, we are finally ready to showcase SimpleGPT’s textual prowess. Time for some faux-Shakespeare!

# Time for some faux-Shakespeare
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))

Right?

We’re not quite at the finish line YET!
To achieve better text outputs, the human touch remains essential!

Refining the model via the “Iterative Tango” that is hyperparameter tuning

When running the model initially with a “base” set of hyperparameters, we observed it attempting to replicate text with a Shakespearean flavour. However, the output was a blend of genuine English and nonsensical gibberish.

First Pass of Model 1; n_embd = 32 n_head = 4 n_layer = 4 learning_rate = 1e-3 batch_size = 16 block_size = 32 dropout = 0

We proceeded to enhance the model’s performance, first by tuning the learning_rate. This hyperparameter plays a pivotal role in controlling the convergence of the training process.

If set too high, the model might overshoot the optimal parameters, leading to erratic behavior.
If set too low, the learning process can become painfully slow and might get stuck in local optima.

Considering these factors, we adjusted the learning_rate to a value of 4.5e-3. With this new learning rate, our second pass:

Demonstrated a superior optimization trajectory, marked by lower loss values in both training and validation.
Generated more refined text outputs.
Notable mentions include names like “HEN ELIZABETH” and dialogues such as “Peace, I her prayets! you.”, which echo authentic characters and phrases from Shakespeare’s canon.

Second Pass of Model 1; n_embd = 32 n_head = 4 n_layer = 4 learning_rate = 4.5e-3 batch_size = 16 block_size = 32 dropout = 0

Subsequently, we focused on the batch_size. This hyperparameter dictates the number of training samples used in one training iteration — fine-tuning this balance ensures we’re harnessing the computational efficiency while still allowing the model to generalize effectively to new data.

A larger batch size gives a more accurate estimate of the loss function, at the cost of more computational resources and reduced generalizability.
A smaller batch size, on the contrary, requires less intensive computational resources, at the cost of a noisier estimate of the loss function.

After several tests, a batch_size of 24 emerged as the most promising for further adjustments.

Here we see that character dialogues, while still interspersed with gibberish, moves closer to comprehensible English.
This improvement is evident in phrases such as “You NAPsed made venum king that somes” & “But the pount emyple to this gance the fain”.

Third Pass of Model 1; n_embd = 32 n_head = 4 n_layer = 4 learning_rate = 4.5e-3 batch_size = 24 block_size = 32 dropout = 0

Having set the foundational dynamics, we opted to defer the tuning of the dropout_rate. The aim is to establish a stable model that can later harness the nuanced advantages dropout brings.

We then proceed to finalize the model architecture. Changes here directly impact how the model processes information, essentially shaping its “thinking” pattern. . For this article’s context, we will specifically adjust hyperparameters like n_embd, n_head & n_layer, which were set to the values of 32, 4 and 5 respectively.

Fourth Pass of Model 1; n_embd = 128 n_head = 4 n_layer = 5 learning_rate = 4.5e-3 batch_size = 24 block_size = 32 dropout = 0

In this fourth iteration, we see a marked improvement in the model capturing the Shakespeare essence, particularly in terms of:

Diction and Syntax: Phrases such as “Expiercy, thy sheek foul hime to eaven thy glad” and “The perselves once” appear more aligned with Shakespearean language. Also, the syntax, while still having room for improvement, appears more structured.
Character Dialogue: The output displays a more consistent adherence to the Shakespearean style in terms of language usage and character interaction, albeit with bouts of gibberish and abrupt transitions.

Curtain Call: The Iterative Tango of Model Refinement

As the curtains gently sway, hinting at their eventual descent, we pause to recount the strides and missteps of our protagonist, the plucky ‘SimpleGPT.’ From the introductory fanfare, where the ‘Tiny Shakespeare’ dataset took the stage, to the intricate pre-processing pas de deux that translated text to tokens, we’ve choreographed a dance of numbers and algorithms on Python’s versatile stage.

Our story has unfolded in acts of iteration, where ‘SimpleGPT’, as both learner and performer, stepped through its paces — adjusting, adapting, and aspiring to the eloquence of Shakespeare himself. With each tuning pass, our hyperparameters — the guiding light for our performance — conducted the computational dance, setting the tempo for learning rate, the rhythm for batch size, and the choreography for model architecture.

It is via this iterative tango, as we tweaked and twirled the dials, that a semblance of the bard’s cadence began to echo through the model’s outputs. The batch size balance, a meticulous measure between computational prowess and poetic generalizability, brought us closer still to a faux-Elizabethan grace.

Yet, for every stanza that soared, a line of gibberish firmly grounded us, a stark reminder of the journey yet ahead.

We close this act not with a grand finale, but with a promise of continuity. The fourth iteration, has shown promising whispers of Shakespearean flair — improved diction and syntax flowing from its digital quill. The artistry of human intuition remains our faithful director, ensuring that the final brushstrokes on this computational canvas will lead to a masterpiece of machine-generated literature.

So, dear audience, as the curtains close on the first part of our narrative, we anticipate the road ahead. Stay tuned for the second act, where we will delve deeper into the refinement of our model, striving for the day when our GPT can truly claim to have captured the spirit of Shakespeare. Until then, we bid you adieu, with the immortal words from the man himself: “Parting is such sweet sorrow that I shall say goodnight till it be morrow.”

References (links are embedded) — Yes they are free

Do consider dropping a follow if you found this valuable!