Building a GPT Model From Scratch in AWS Sagemaker

Building the Model using PyTorch, Python, and AWS services

13 min readJul 26, 2024

Introduction

In this blog, we will create a Generative Pre-trained Transformer (GPT) model from scratch. This character-level language model will be built using AWS SageMaker and S3 services. AWS SageMaker is one of the leading services for machine learning. This entire model is built with the help of Andrej Karpathy's YouTube video. This has the best tutorial for neural networks and GPT implementations. The implementation will utilize PyTorch and Python. Let’s get started!

Setting Up AWS SageMaker Notebook

Navigate to SageMaker services and create a new notebook.

Provide a name and select an appropriate instance type.

GPU instances are best for large language model training because of their accelerated computation behaviors. So you can select any instance in the P2, P3, or G4dn instance types (high cost). Here we are going to use ml.p3.2xlarge instance.

Create an IAM role or select an existing role

Create or select an IAM role with access to read input data from an S3 bucket. Ensure the necessary policies are attached.

Click “Create Notebook” and then “Open Jupyter”.

Select conda_pytorch_p310

Jupyter Notebook is now ready for use!

AWS S3 Bucket

Here, we will store our input data in an S3 bucket.

In the AWS S3 service, click the “Create Bucket” button.

Give a unique Bucket name and create the bucket
Go inside the Bucket and upload the input data.

Now, let’s start the coding

Implementation

Here we will build a GPT model, which is a decoder-only model used for text generation. In contrast, the ‘Attention Is All You Need’ paper describes an encoder-decoder architecture, as it is designed for machine translation.

There are two key differences between this architecture and our model: the absence of the encoder block and the cross-attention component between the encoder and decoder blocks. Therefore, our model architecture is structured as follows:

Here, the same attention and feed-forward blocks will be repeated multiple times. Therefore, our model will be a decoder with multiple blocks.

Dependencies

import torch
import torch.nn as nn
from torch.nn import functional as F
import datasets
import s3fs

Hyperparameters

You can adjust these values as needed.

batch_size = 16 
block_size = 32
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embedding = 384
n_heads = 6
n_layers = 6
dropout = 0.2

Training data

We will use an English text for the training and validation data.

# Create an S3 filesystem object
fs = s3fs.S3FileSystem()

# Specify the S3 path to your text file
s3_path = '<your S3 URI > '

Read the text file directly
with fs.open(s3_path, 'r', encoding='utf-8') as f:
    text = f.read()

Example Input Data:

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is the chief enemy of the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Isn't it a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word: good citizens.

Before the training process, we need to split the input text into training and validation sets.

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be training, rest val
train_data = data[:n]
val_data = data[n:]

Encode and Decode

This model does not understand characters, words, or sentences — it only understands numbers. Therefore, all inputs and outputs should be in vector/embedding format. Therefore, We need to encode the input data and decode the output sequence to see the generated text. This involves mapping characters to integers and vice versa.

charecters = sorted(list(set(text)))
vocab_size = len(charecters)

# characters to integer mapping
ctoi = {ch:i for i,ch in enumerate(charecters)}

# integers to characters  map
itoc = {i:ch for i,ch in enumerate(charecters)} 

Encoding and decoding
encode = lambda s: [ctoi [c] for c in s]
decode = lambda I: ''.join([itoc[i] for i in I])

Data loading part

In this step, we will create batches of input data for both training and validation.

torch.manual_seed(1337)

def get_batch_data(Data):
    """
        creating the input and target batches 
    """
    
    if Data == 'train':
        data = train_data
    else:
        data = val_data
        
        
    ix = torch.randint(len(data)-block_size, (batch_size))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x,y = x.to(device), y.to(device)
    
    return x,y

An important consideration is how to select the input and target batches. Most language models predict the next token from the previous tokens, so within a single batch, multiple training examples are performed. For instance, if the input tensor has a block size of 10, it looks like this:

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47,64])

We don’t feed the entire tensor at once to predict and get the logits. Instead, we feed the data incrementally. So In a chunk of 11 characters, there are 10 individual examples packed together.

when input is a tensor [18], the target is 47
when input is a tensor ([18, 47]), the target is 56
when input is a tensor ([18, 47, 56]), the target is 57
when input is a tensor [18, 47, 56, 57], the target is 58
when input is a tensor [18, 47, 56, 57, 58], the target is 1
when input is a tensor ([18, 47, 56, 57, 58, 1]), the target is 15
when input is a tensor [18, 47, 56, 57, 58,  1, 15], the target is 47
when input is a tensor [18, 47, 56, 57, 58,  1, 15, 47], the target is 58
when input is a tensor ([18, 47, 56, 57, 58,  1, 15, 47, 58]), the target is 47
when input is a tensor ([18, 47, 56, 57, 58,  1, 15, 47, 58, 47]), the target is 64

To predict the (n+1) element, we need to feed the previous n elements in sequence. That’s why, during batching, we choose [i: i + block_size] for inputs and [i + 1: i + block_size + 1] for targets.

Loss estimation function

To evaluate the model during training, we need to define a function that performs evaluations at specific iteration intervals and outputs the mean value of the training and validation loss.

The crucial part is that the loss calculation should happen without updating the model’s parameters.

@torch.no_grad() # this disables the gradient calculation
def estimate_loss():
    output = {}
    model.eval() # starting the model evaluation step
    
    for split in ['train','val']:
        losses = torch.zeros(eval_iters) # initialize a tensor to store losses
        for j in range(eval_iters):
            X,Y = get_batch_data(split) # get batch data
            logits, loss = model(X, Y) # get model prediction and loss
            losses[j] = loss.item() # store loss in the losses tensor
        output[split] = losses.mean()
        
    model.train() # continuing the training
    
    return output

The @torch.no_grad() decorator

used to disable the gradient calculation during this evaluation.
So this prevents backpropagation during this function. This saves memory and computational resources.

2. The model.eval():

used to set the model to evaluation mode.
This is important because certain layers, for example, dropout and normalization, behave differently during training and evaluation.

3. The function returns a dictionary containing the mean training and validation losses.

Self-Attention

This is the crucial part of the model. Here we need to build a Multi-head attention model. you can refer to my previous blog to get an idea regarding self-attention.

Self-Attention Mechanism In Transformers

Understanding the Core of Modern Language Models

medium.com

Self-Attention

Self-attention is a crucial part of the model. Here, we need to build a multi-head attention model. You can refer to my previous blog for a detailed explanation of self-attention.

class Head(nn.Module):
    """
        Single head self attention
    """
    def __init__(self, head_size):
        super().__init__()
        
        self.key = nn.Linear(n_embedding, head_size, bias=false)
        self.value = nn.Linear(n_embedding, head_size, bias=False)
        self.query = nn.Linear(n_embedding, head_size, bias = False)
        
        self.register_buffer('tril', torch.tril(torch.ones(block_size,block_size)))
        
        self.dropout = nn.Dropout(dropout)
        
        
    def forward(self, x):
        """
            Input : (B, T, C) 
            Output : (B, T, head_size)
        """
        B,T,C = x.shape
        k = self.key(x)
        q = self.query(x)
        
        weights = q @ k.transpose(-2,-1)*k.shape[-1]**-0.5
        weights = weights.masked_fill(self.tril[:T,:T]==0, float('-inf'))
        weights = F.softmax(weights, dim=-1)
        weights = self.dropout(weights)
        
        v = self.value(x)
        output = weights @ v
        return output

During the key, value, and query initialization steps, n_embedding is the input feature dimension, and head_size is the output feature dimension of the linear layers.
Setting bias=False is important because having a bias in the queries and keys would complicate the dot product calculations.
We use a lower triangular matrix here to prevent future data from being used to calculate attention weights. This technique, called masking, ensures causality in autoregressive models.
In PyTorch, tril is registered as a buffer, meaning it is part of the model's state but not a learnable parameter.
Also, Dropout is applied to prevent overfitting.

2. Implementing the Multi-head Attention layer

To implement the multi-head attention layer, we can use the single-head attention layers defined above. In this case, we use 6 heads for the attention system.

class MultiHeadSelfAttention(nn.Module):
    """
        This creates multiple head of self-attention using Head class
    """
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        
        # Transforms concatenated output into fixed embedding size 
        self.proj = nn.Linear(head_size*num_heads, n_embedding)      
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        output = torch.cat([head(x) for head in self.heads], dim=-1)
        output = self.dropout(self.proj(output))
        
        return output

To initialize the multiple attention heads, we use nn.ModuleList() in Pytorch. This is a container used for holding multiple sub-modules.
The self.proj() function is used to project the concatenated output from all attention heads to the original embedding size.
The outputs of all the heads are concatenated along the last dimension.

Feed-Forward Neural Network

The output of the multi-head attention layer is normalized and fed into a feed-forward neural network. This step introduces non-linearity, enabling richer representations and transforming dimensions to facilitate downstream tasks.

In simple terms, while the self-attention layer captures the connections between input tokens, we need a component to understand the content of those connections. This is where these neural networks come into play.

class FeedForwardNN(nn.Module):
    """
        
    """
def __init__(self, n_embedding):
        super().__init__()
        self.ff_net = nn.Sequential(
            nn.Linear(n_embedding, 4*n_embedding),
            nn.ReLU(),
            nn.Linear(4*n_embedding, n_embedding),
            nn.Dropout(dropout),
        )
        
    def forward(self, x):
        return self.ff_net(x)

Creating Self-Attention + Feed-Forward Block

The combination of the self-attention and feed-forward components is repeated multiple times in a decoder block. In this case, we set n_layers: 6, so this combination will be repeated six times.

class Block(nn.Module):
    """
        One Masked Multi-attention and Feed Forward NN included in this
    """
    def __init__(self, n_embedding, n_heads):
        super().__init__()
        
        head_size = n_embedding/n_heads
        self.S_A = MultiHeadSelfAttention(n_heads, head_size)
        self.ffnn = FeedForwardNN(n_embedding)
        
        # layer normalization
        self.ln1 = nn.LayerNorm(n_embedding)
        self.ln2 = nn.LayerNorm(n_embedding)
        
    def forward(self, x):
        
        x = x + self.S_A(self.ln1(x))
        x = x + self.ffnn(self.ln2(x))
        
        return x

The Layer normalization stabilizes and accelerates the training process by normalizing inputs across feature dimensions, independent of the batch size.
Residual connections are employed to mitigate vanishing gradient problems, aiding in training deeper networks.

In the original paper, the layer normalization step is applied after the self-attention and feed-forward networks. However, recent improvements suggest that performing normalization before the attention and feed-forward networks yields better performance.

GPT Language Model

Now let’s combine all the individual processes and components to build our GPT model. This is a decoder-only transformer model that uses self-attention mechanisms to consider a broader context (multiple preceding words) for predicting the next word.

class GPTLangiageModel(nn.Module):
    """
        implementing a decoder-only model. with multi-head self-attention
    """
    
    def __init__(self):
        super().__init__()
        
        self.token_embedding_table = nn.Embedding(vocab_size, n_embedding)
        self.position_embedding_table = nn.Embedding(block_size, n_embedding)
        self.blocks = nn.Sequential(*[Block(n_embedding, n_heads = n_heads) for _ in range(n_layers)])
        self.ln_final = nn.LayerNorm(n_embedding)
        self.lm_head = nn.Linear(n_embedding, vocab_size)
        
        self.apply(self._init_weights)
        
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, standard=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
                
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, standard=0.02)
            
            
    def forward(self, idx, targets=None):
        B,T = idx.shape
        
        token_embd = self.token_embedding_table(idx)
        pos_embd = self.position_embedding_table(torch.arange(T, device=device))
        x = token_embd + pos_embd
        x = self.blocks(x)
        x = self.ln_final(x)
        logits = self.lm_head(x)
        
        if targets is None:
            loss = None
            
        else:
            B,T,C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)
            
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:]
            logits, loss = self(idx_cond)
            logits = logits[:,-1,:]
            probs = F.softmax(logits, dim= -1)
            
            idx_next = torch.multinomial(probs, num_samples=1)
            
            idx = torch.cat((idx, idx_next), dim= 1)
            
        return idx

Initiate weights

The _init_weights() method is responsible for initializing the weights of the model components to ensure proper learning.

For linear layers. The weights are initialized with a normal distribution (mean = 0.0, standard deviation = 0.02). If the layer has a bias, it is initialized to zero.
For embedding layers. the weights are also initialized with a normal distribution (mean = 0.0, standard deviation = 0.02), without considering bias values.

The proper initialization helps with faster convergence and minimizes the vanishing or exploding gradient issues.

Generation

The generate() method handles text generation based on the input sequence:

idx is the initial input sequence. We only consider the last “block_size” tokens to ensure the context is limited.
Softmax is applied to the logits for the last time step or the most recent token. This creates a probability distribution over possible next tokens for each sequence in the batch.
The torch.multinomial() function takes a tensor of probabilities as input and returns a token, which is sampled from the distribution for each sequence in the batch.
Higher probability values have a higher chance of being selected.
The predicted token is concatenated to the input idx for the next token prediction

Training and validation state

let’s start our training process.

model = GPTLangiageModel()
m = model.to(device)

parameters = sum(p.numel() for p in m.parameters())/1e6
print('no of model parameters :', parameters, 'M parameters')

number of model parameters : 10.702913 M parameters

This is a nearly 10M parameter model. Comparatively, this is a small model. This training will go up to 5000 iterations, though you can change this iteration value.

optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
for iter in range (max_iters):
    
    if iter% eval_interval == 0 or iter == max_iters - 1:
        print("-------------")
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
        
    xb, yb = get_batch_data('train')
    
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

During each evaluation interval, or toward the end of training, the model is evaluated to provide training and validation losses. These metrics help us understand how well the model is learning over the training period.

-------------
step 0: train loss 4.2084, val loss 4.2108
-------------
step 500: train loss 2.1915, val loss 2.2019
-------------
step 1000: train loss 2.0039, val loss 2.0909
-------------
step 1500: train loss 1.9052, val loss 1.9984
-------------
step 2000: train loss 1.8381, val loss 1.9632
-------------
step 2500: train loss 1.7756, val loss 1.9147
-------------
step 3000: train loss 1.7193, val loss 1.8746
-------------
step 3500: train loss 1.6650, val loss 1.8287
-------------
step 4000: train loss 1.6545, val loss 1.8153
-------------
step 4500: train loss 1.6299, val loss 1.7999
-------------
step 4999: train loss 1.6032, val loss 1.7887

Once training is complete, we can assess the model’s performance.

Model Performance

To evaluate the model, we input a (1,1) tensor with zeros and generate up to 1000 tokens. The resulting text showcases the model’s ability to produce coherent output based on the patterns it has learned.

context = torch.zeros((1,1), dtype = torch.long, device = device)
print(decode(m.generate(context, max_new_tokens=1000)[0].tolist()))

Although this text doesn’t have a clear meaning, it demonstrates that the model has learned some words, structures, and formats from the input data, as illustrated in the output below.

Like me:
Unlendong prosenave heaven of goose!
But disgn the trum mount to RiChalt that venteel; give.
Give my melt to Blear sting, and that my but plets,
In thinking made am our head you, the packin drim thee,
Bream comes right, MoreuStremplain sens
And sight up somemen men with two hands.

KING RICHARD I:
I thought you would be bein him on-famp, this and to
But for at plaid; Good at blanish
As tothing him truff me doats, must thee!
And aliver, thou arabow.

SARNCE:
And shrave pennort; then grease and make young thing: he should
Spirists upring to the truth. I why war man in dear, like no
deservisTouch thou too m ciky pland, that's all
herd hearl, I pray thy omplice flses to marrion your harb,
And first, no, is stukesse fear:
Capture your master time, unwerch thought it
Come alintime frong touch thever brong goodngciust. Romeo Randur's all men deams;'
That I live my lordly Rome.

KING RICH

PAURENCE:

CLAURENCEN:
I did posters, uppeM my briem, and show the compunty Cwarenciess him

Conclusion

Congratulations! You have successfully implemented a basic Generative Pre-trained Transformer (GPT) model and trained and validated it using custom data. Throughout this blog, I have aimed to explain critical components such as self-attention, feed-forward layers, dropout, and loss estimation. We then integrated these components to create the model and trained it for 5000 iterations on a GPU instance in SageMaker. Additionally, you have seen how the model performs in generating new text. I hope this blog has provided you with a clear understanding of how to build a GPT model from scratch.

Thank You

Building a GPT Model From Scratch in AWS Sagemaker

Building the Model using PyTorch, Python, and AWS services

Introduction

Setting Up AWS SageMaker Notebook

AWS S3 Bucket

Implementation

Dependencies

Hyperparameters

Training data

Encode and Decode

Data loading part

Loss estimation function

Self-Attention

Self-Attention Mechanism In Transformers

Understanding the Core of Modern Language Models

Self-Attention

Feed-Forward Neural Network

Creating Self-Attention + Feed-Forward Block

GPT Language Model

Initiate weights

Generation

Training and validation state

Model Performance

Conclusion

Written by Priyanthan Govindaraj