Fine Tuning of GPT-1 with Swiftkey Data for next word prediction
This is in continuation with the main article Next Word Prediction using Swiftkey Data
GPT-1 is a decoder only transformer that uses masked self attention to predict next words based on probability . GPT -1 is trained on data with vocabulary size: 40478 and max sequence length: 512 . It develops a language model . I have fine tuned that using Swift key data . Over the following sections , I will discuss in details the architecture and details of fine tuning .
Feature Development
GPT Model takes in sentences as input to build the probabilistic model during training .
Steps for data generation :
- Cleaning the corpus
- Encoding the words in the corpus using GPT — Tokenizer .
- Creating sentences of length 19 from the corpus
- Generating these sentences using Data Collator
Code Sample
How does fine tuning of GPT-1 work ?
First , GPT is pretrained on an unsupervised corpus to develop a language model . The following probability equation is maximised.
where k is the size of the sequence .
Now during fine tuning we train it on another supervised corpus , supervised in the sense we consider the history words and the words to be predicted next given that history .In this another probability equation is maximised .
In this case also , m is the length of each sequence .
Thus joining these 2 equations we get the final equation as :
Thus in this way fine tuning is achieved .
Architecture of the model
I have used GPT-1 because it has much lower parameters as compared to GPT-2 and GPT-3 thus suitable for low memory environment .
Code Sample:
Github Link : https://github.com/kurchi1205/Next-word-Prediction-using-Swiftkey-Data/blob/main/GPT%20Model.ipynb
Train Results
Train Loss : 4.778
Test Results
Test Loss :5.52
Next Word Prediction
I have used 3 different ways for prediction of the next word .
- Greedy Search : chooses the best possible next word based on highest probability from 1 hypothesis
- Beam Search : chooses the high probability next word from n hypothesis
- Random Sampling : chooses random next word from possible hypothesis , however as the temperature is set high , it will ignore low probability words.
Code Sample
Prediction Examples
Thus overall GPT produces pretty good results . I have considered this as the best model so far in next word prediction . Please refer to the main article for further discussion .