Generative Pretrained Transformer (GPT)

10 min readNov 25, 2023

A primer into the Decoder only Model — Causal Langauge Modelling

The goal is to predict a distribution over vocabulary and we can look at if we can use transformers as a function to predict this distribution.

There are 3 possibilities we can consider for this:

What are the decoder only models?

This is how the vanilla decoder model which has self-attention, cross attention and feed forward network.

The input is a sequence of words where the given k words will be used to predict the k+1 words which is the basic task of language modelling. This k could start from 0 i.e., if we are not given any word then it will start with <go> and predict the first word and given the first word we predict the second word and it goes like this iteratively until the task is done. Here we want the model to see only the present and past inputs. We can achieve this by applying the mask and zero out the weights for all the future words.

During training we have all the data actually where the entire document and entire paragraph where we suppose we want to predict the 5th word — it does not makes sense to predict what comes after 5th word because what comes after is the task for the model to predict and it has to first generate the 5th word and then take the 5 words and generate the 6th word — if we already assume what comes after the 5th word then the task of predicting the 5th word is going to become easier and in the real world when the model is being used, it will not have access to 6th,7th or 8th word because we expect a user to give prompt and complete the following sentence “I am going to” and then it just stops and it does not give you the rest of context. At training time — we have liberty of knowing the entire context and entire paragraph but we have to work around that. We cannot exploit that liberty and we have to let go off that information. Hence we can only look at present and past inputs — we can achieve this by applying the mask, so when we are computing the attention over the remaining elements we just do everything as normal business but at the end when we are trying to calculate the attention weights, we just apply the mask and zero out the weights for all the future words, that is as good as saying that we are going to do summation with i = 1 to T (sequence length). So now the new jth representation has contains all the key words where we do not want all the key words but want only first k words which we are allowed to see. So for all other remaining i we will make all alphas 0. How to do this mathematically is where we have mask matrix which we add and it contains -infinity which can make sure that the weigths become 0.

This sum equation instead of summing of all T elements but we want only do sum over the first ‘k’ elements by making the alphas to be 0 for all the other elements — this is what the concept of masking is all about.

we can achieve this by applying the mask matrix. The masked multi-head attention layer is required but we don’t need any cross attention because here we are dealing with only decoder only model where there is no encoder concept here. This is how the decoder only model looks like.

The output represents each term in the chain rule so the first instance is predicting the probability of first word distribution from any word taking any of the V word vocabulary.

P(x1) is a marginal distribution which is not conditioned on anything, it is just a first word. At every stage we are predicting the distribution over the vocabulary. However this time the proabbilities are determined by the parameters of the model i.e., assume that all the parameters in trasformers like W,Q,V and parameters of FFN are given and we assume they are part of Theta. Now if given input with 3 words (go, I, am) and embeddings (n dim) of (I, am) are also trained and are part of theta and did pass into the transformers and did also sort of computations which produces an output with softmax. This gives a probability distribution computed by transformer using paramaters of transformers.

Therefore, the objective is to maximize the likelihood L(theta). So if we are going with the first word <go> then the first probability output word has to be maximum with P(I) and now when we take ‘I” and pass into as input as 2nd word then probability P(x2 = am/x1= I) has to maximize. This is an iteration back propagation algorithm will be used and we will keep adjusting the parameters theta till over a period of time, where probabilities are starting to align to actually want them align to be and finally the correct word gets the maximum probability.

Now let’s move onto the Generative Pretrained Transformer (GPT) model on how it looks like — now we can create a stack (n) of modified deocoder layers (called transformer block). Let X denote the input sequence

Now let’s look at somethings like:

Data that is trained on?
Architecture — How many dimensions? How many attention heads? Compute total number of parameters?

The input data is from a book corpus

Which contains 7000 unique books, 74 Million sentences and approximately 1 Billion words across 16 genres.
Also it uses long-range contiguous text (i.e., no shuffling of sentences or paragraphs).
The sequence length ‘T’ is 512 tokens or context size and it is contiguous.
It has a tokenizer which is Byte Pair Encoding.
Vocabulary size of 40478 unqiue words.
Embedding dimension is 768.

Model

Contains 12 decoder layers (transformer blocks)
Context size : 512 → 512 inputs/Tokens at one go (T)
Attention heads : 12
Feed Forward Network layer size : 768 x 4 times = 3072
Activation : Gaussian Error Linear Unit (GELU)

This activation comes in FFN and they use Dropout, layer normalization, and residual connections were implemented to enhance convergence during training. FFN will take 768 dimensional output which is calculated from 12 attention heads (64 x 12) where each head produces a 64 dimensional output and concatenated to get the 768 dimensional output which matches with input dimensional size of each token. Hence each of these transformers block will give output of 768 dimensions but in between each transformer block which has FFN converts into 768 x 4 = 3072 and transforms or shrinks back into 768 dimensional output.

Let us take a sample input data of 512 tokens (Assuming the data is split with space as identifier for splitting into tokens)

If we give this entire 512 token input to the transformer block with word embedding correpsonding to each token. This input would be T x d model which is 768. Each of the embedding is of size 768 or dmodel and we have T such embeddings. So input size is T x dmodel, so for each layer we are going to have T x dmodel output till the last layer and at the last layer we convert into softmax and we will be predicting T probability distributions. This is all mask applied and hence we are able to do all these processes in parallel and we are making sure that at every stage we are not going to look at the future and all probabilties are getting generated in parallel.

This is how one training batch looks like (with batch size = 1) with 512 tokens which is passed into first transformer block and it passes into the multi head masked attention and then goes to FFN.

Let’s look at how multi-head masked attention looks like. The first head will take all these 512 tokens and it will generate Q, K, V vectors and do this entire operation until MatMul and again gives 512 vectors. The size of each vector will be dmodel/#heads = 768/12 = 64. Hence it produces each of individual head output and then concatenate to get the 768 output size.

All 12 head outputs are generated in parallel to produce 64 dimensional vectors before concatenating to 768 dimensional output. So this will happen for all 512 input tokens. Finally we get such 512 x 768 vectors, No of inputs will be same as No of outputs and will be same across each layer. So we had a 512 x 768 input and it went to into the masked attention and gave 512 x 768 output. Each of those 768 output went into the linear transformation which kind of completes one cycle of masked multi head attention.At this point of time we got all 512 new representations each being of size 768.

After adding a drop out layer and looking at how residual connection is passed — which basically flows from the original input gives us an output with H1 = X + H which is again of size 512 x 768 which should pass into the FFN and let’s see how this looks like there.

On now all the 512 tokens of 768 size will pass into the FFN and produce 3072 intermediate vecrtor size before the final 768 size vector is again produced. So likewise this will be done for all 512 tokens. The illustration below is done for only one of the token along with GELU activation passed onto the hidden layer.

This is how the one block of network output look like as shown above.

To just review upon what has happend so far:

Taking a large piece of contiguous text which is 512 tokens
Feeding it to the transformer layer and every layer we are again producing 512 outputs till the last layer
At this last layer we are applying softmax to convert this into 512 probabilities which are P(x1), P(x2/x1), P(x3/x1,x2)………..P(x512/x1,x2,x3…x511)
All of these predictions are happening at parallel and a mask is ensuring that we cannot see any of the future words or which are not allwoed to see

Now let us check the Number of Parameters:

token Embeddings : |v| x Embedding_dimension

= 40478 x 768 = 31 x 10⁶ = 31M

~40K words in the vocabulary and dmodel is 768 dimensions.

Position Embeddings : Context length x embedding_dimensions

= 512 x 768 = 0.3 x 10⁶ = 0.3M

Total = 31.3 M parameters only in the input layer (embedding and positional embedding)

The positional embeddings are also learned, unlike the original transformer which uses fixed sinusoidal embeddings to encode the positions.

For attention parameters we have 3 matrices Wq, Wk and Wv and each of them takes 768 dimensions input and converts into 64 dimension output so matrix will be 768 x 64 — this is per attention head and we have 3 such matrices so hence 3 x (768 x 64) ~ 147 x 10³. So we have 12 such heads hence the parameters will be 12 x 147 x 10³ ~ 1.7M. Then a linear layer is considered where we have 768 dimension which is a concatenated output (64 x 12) which is multiplied by Wo (768 x 768 dimension) gives the 768 vector — 768 x 768 ~ 0.6M. So finally for all 12 blocks it is ~27.6M parameters exist for masked multihead self attention summed up across all layers.

bias of 3072 and 768 respectively along with 2 x (768 x 3072) will give finally the FFN parameters which summed up to 4.7M for one block. Hence when done for all 12 blocks it is 12 x 4.7 ~ 56.4M

At the end of the complete transformer we will get 768 dimensional size output which is again transformed into a 40478 dimensions where we require 768 x 40478 matrix to get the 40478 dimensions. The initial input embedding matrix is of size — 40478 x 768. These matrices of input and output can be shared, hence they are not included as parameters.

This GPT-1 model has lesser parameters as compared to the present SOTA models in the market.

Please do clap 👏 or comment if you find it helpful ❤️🙏

References:

Introduction to Large Language Models — Instructor: Mitesh M. Khapra