Next Word Prediction using GPT-1

3 min readAug 30, 2021

Fine Tuning of GPT-1 with Swiftkey Data for next word prediction

This is in continuation with the main article Next Word Prediction using Swiftkey Data

GPT-1 is a decoder only transformer that uses masked self attention to predict next words based on probability . GPT -1 is trained on data with vocabulary size: 40478 and max sequence length: 512 . It develops a language model . I have fine tuned that using Swift key data . Over the following sections , I will discuss in details the architecture and details of fine tuning .

Feature Development

GPT Model takes in sentences as input to build the probabilistic model during training .

Steps for data generation :

Cleaning the corpus
Encoding the words in the corpus using GPT — Tokenizer .
Creating sentences of length 19 from the corpus
Generating these sentences using Data Collator

Code Sample

How does fine tuning of GPT-1 work ?

First , GPT is pretrained on an unsupervised corpus to develop a language model . The following probability equation is maximised.

where k is the size of the sequence .

Now during fine tuning we train it on another supervised corpus , supervised in the sense we consider the history words and the words to be predicted next given that history .In this another probability equation is maximised .

In this case also , m is the length of each sequence .

Thus joining these 2 equations we get the final equation as :

Thus in this way fine tuning is achieved .

Architecture of the model

I have used GPT-1 because it has much lower parameters as compared to GPT-2 and GPT-3 thus suitable for low memory environment .

Code Sample:

Github Link : https://github.com/kurchi1205/Next-word-Prediction-using-Swiftkey-Data/blob/main/GPT%20Model.ipynb

Train Results

Train Loss : 4.778

Test Results

Test Loss :5.52

Next Word Prediction

I have used 3 different ways for prediction of the next word .

Greedy Search : chooses the best possible next word based on highest probability from 1 hypothesis
Beam Search : chooses the high probability next word from n hypothesis
Random Sampling : chooses random next word from possible hypothesis , however as the temperature is set high , it will ignore low probability words.