Generative Pretrained Transformer (GPT)— Pre-training , Fine Tuning & Different Use Case Applications

Shravan Kumar
6 min readNov 27, 2023

--

In the previous blog we studied about entire overview on Generative Pretrained Transformer. Now let us look at super important topics on pre-training, fine tuning and different use case applications.

Pre-training is about when there is no supervision or explicit supervision and we get auotamtic supervision from large unlabeled corpus where every next token is the label that we need to predict.

During pre-trianing they used:

Batch size : 64

Input size : (B,T,C) = (64, 512, 768), where, T is sequence length and C is an embedding dimension

Optimizer : Adam with cosine learning rate scheduler

Strategy : Teacher forcing (instead of auto-regressive training) for qucker and stable convergence

Why do we need to use teacher forcing?

During training phase when the weights are almost close to random during the first few epochs, if we ask it to generate a token and then feed that token as the input and then predict the next token then there would be a problem because this itself is not sharp and accurate at where it is not predicting the right word and then this error will back propagate and produce error results. Instead, as we know the actual ground truth values with sequence , if we feed the right input at every right step (ie., teacher forcing) then train it — this way we are sending the actual sequence instead of the intermediate output predicted by the model. This will lead to quicker and stable convergence — initially it needs this kind of training help, at some point we could remove that training method if we want.

Now let’s focus on Fine-tuning. Fine-tuning involves adapting model for various downstream tasks (with a minimal change in the architecture) For e.g. for cases like Sentiment analysis, Question-Answering, Summarization, sentence relation between multiple sentences.

  • Each sample in a labelled data set C consists of a sequnce of tokens x2, x2, …..xm with the label y
  • Initialize the parameters with the parameters learned by solving the pre-training objective
  • At the input side, add additional tokens based on the type of downstream task. For example, start <s> and end <e> tokens for classification tasks
  • At the output side, replace the pre-trainng LM head with the classification head ( a linear layer Wy)

The final token representation who had seen all the previous tokens has the knowledge of all tokens from the sequence because for this token there was no masking applied here — whatever the output we get here, we can use that to make predictions that we want to make. This final token representation has seen the entire document , now we could choose to use that representation and then decide whether it is a positive review or negative review. The last token has a size of 768 dimension and we could add this to W matrix and converts into 1-dimensional output which indicated 0/1. If the target output is with 10 classes then it will convert into 10 dimensional output which when applied softmax on top of that we get the desired output of which class has the maximum probability.

Now our objective is to predict the label of the input sequence.

Here layer l = 12 and m = 512th position token

Task : Sentiment Analysis

Consider the review which we have only 5 words

Text: Wow India has reached moon

Sentiment: Positive

The output at last step will be 768 dimensions and then Wy will transform the output with 2 dimensional vector — of that we will have to check the class which will be maximized and back propogate and all the parameters (attention parameters, FFN parameters etc. )will change accrodingly.

Task : Textual Entailment / Contradiction

Text: A soccer game with multiple males playing

Hypothesis : Some men are playing a sport

Entailment : True

Here we have 2 inputs with text and hypothesis —In this case, we need to use a delimiter token ($) to differentiate the text from the hypothesis. Assuming if we have 3 classes here (True/False/Can’t say) — we will have 768 dimensional output — Wy will take this 768 dimensional input and map it to 3 classes and predict the probability distribution after applying the softmax. If it takes one class as output then we gonna take -log(predicted class) as loss function and back propogate through the network and fine tune all the parameters.

Why we are calling it as Fine Tuning? Because we already have pre-trained the network and we have weigths at certain configuration , now we are just adjusting them for this particular task as opposed to starting from random initialization and then trying to adjust all the weights for this task that would have been training but this is a fine-tuning as we already at some configuration and trying to adjust for this task.

Task : Multiple Choice

Question: Which of the following animals is an amphibian?

Choice-1 : Frog

Choice-2 : Fish

Feed in the question along with choice-1

Feed in the question along with choice-2

Repeat this for all choices

Normalize via softmax

whichever has got the correct choice then we will backpropogate and get the maximized probability.

All of these NLP tasks have been adapted where the network has been adapted to these tasks after it has been pre-trained. We have found the appropriate input representation, in some cases we have to add $ and we also found the appropriate thing to do at the output, which means we ignore the next token prediction and add just one layer which predicts the classes required for these tasks. So this is what is done in fine-tuning.

Task: Text Generation

Input:

Prompt : I like

Does it produce the same output sequence for the given prompt?

  • Yes it gives same sequence because it is deterministic — This is not a favorable output because we might need to have some more creative kind of output whenever we start writing same prompt again and again. So for that we need to understand different decoding strategies which would help in producing some variety output by using the same prompt.

Hence the wishlist for text generation cases would be :

  • Discourage degenerative (That is, repeated or incoherent) texts

like — I like to think that I like to think ……..

I like to think that reference know how to think best selling book

  • Encourage it to be creative in generating a sequence for the same prompt

I like to read a book

I like to buy a hot beverage

I like a person who cares about others

  • Accelerate the generation of tokens

Please do clap 👏 or comment if you find it helpful ❤️🙏

References:

Introduction to Large Language Models — Instructor: Mitesh M. Khapra

--

--

Shravan Kumar

AI Leader | Associate Director @ Novartis. Follow me for more on AI, Data Science