Text Generation in any language with GPT-2
A step-by-step guide to train your own GPT-2 model for text generation in your choice of language from scratch
Note: This blog was originally posted in this following link.
We all heard modern-day Natural Language Processing (NLP) has progressed by leaps and bounds in the past couple of years following the development of attention networks and transformers. It paved the way for a plethora of new algorithms achieving State-Of-The-Art (SOTA) for the different tasks of NLP.
OpenAI has been one of the leaders in providing their own language model (now released GPT-3) which is trained on a huge corpus of internet data. Since GPT-3 is a recent phenomenon and in English at the moment, and is only accessible through API given by OpenAI, we shift our focus on the earlier version of it, i.e. GPT-2. To know about the internal nuts and bolts of GPT-2, I suggest you to go through this link. For more depths into Attention and Transformers, here are some excellent links:
- The illustrated Transformer by Jay Alammar
- The Annotated Transformer by Harvard NLP
GPT-2 was also released for English, which makes it difficult for someone trying to generate text in a different language.
So why not train your own GPT-2 model on your favorite language for text generation? That is exactly what we are going to do. So, without further ado, let us jump in.
For the demo, I have considered a non-Latin alphabet script (Bengali here), because why not? I have used Huggingface’s implementation for the model.
1. Gathering the data
Gathering good quality data is one of the most important stages as all Data Scientists would agree. So we are going to assume that you already have a folder containing .txt files having all the data cleaned and stored. For ease, you can use the Wikipedia article data, which is available and can be downloaded with the following code:
python wikipedia_download.py --lang bn
This will create a folder containing all Wikipedia files looking like:
Note: Due to resource constraint, and since it is for demo purpose, I have trained the model in a small subset of books by Satyajit Ray, especially his detective Feluda series.
2. Tokenization
Now, the second step will be to tokenize the data. For that, we use the following class:
Some notes on the tokenization:
- We use BPE (Byte Pair Encoding), which is a sub-word encoding. This generally takes care of not treating different forms of word as different. (E.g., ‘greatest’ will be treated as two tokens: ‘great’ and ‘est’ which is advantageous since it retains the similarity between great and greatest, while ‘greatest’ has another token ‘est’ added, which makes it different). Also, it is not as low level as character-level encoding, which doesn’t retain any value of a particular word.
- Another small but subtle point is NFKC (Normalization Form Compatibility Composition) in line 13 of code. It is one of the standard Unicode compatibility forms. It would not matter much if the language is English, but since we are using Bengali, which contains a different form of character, we are using this specific one. More on it can be found on this link.
So what we do here is tokenize our data and save it in a folder. Two files will be created (merges.txt and vocab.json) in a specified directory. To run the file, use the following code:
from tokenise import BPE_token
from pathlib import Path
import os # the folder 'text' contains all the files
paths = [str(x) for x in Path("./text/").glob("**/*.txt")]tokenizer = BPE_token()# train the tokenizer model
tokenizer.bpe_train(paths)# saving the tokenized data in our specified folder
save_path = 'tokenized_data'
tokenizer.save_tokenizer(save_path)
3. Model Initialization
Before the real magic begins, we need to make sure the artilleries are ready. Let us start with some initializations.
import tensorflow as tf
from transformers import GPT2Config, TFGPT2LMHeadModel, GPT2Tokenizer# loading tokenizer from the saved model path
tokenizer = GPT2Tokenizer.from_pretrained(save_path)tokenizer.add_special_tokens({
"eos_token": "</s>",
"bos_token": "<s>",
"unk_token": "<unk>",
"pad_token": "<pad>",
"mask_token": "<mask>"
})# creating the configurations from which the model can be made
config = GPT2Config(
vocab_size=tokenizer.vocab_size,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id
)# creating the model
model = TFGPT2LMHeadModel(config)
We also create a single string from all our documents and tokenize it.
single_string = ''for filename in paths:
with open(filename, "r", encoding='utf-8') as f:
x = f.read()
single_string += x + tokenizer.eos_tokenstring_tokenized = tokenizer.encode(single_string)
After we have encoded the whole string, we now move on to make a TensorFlow dataset, slicing the data into equal intervals, so that our model can learn. Here we use a block size of 100 (length of token in each example) and a batch size of 16. This is kept low, else we can run it with ease on a RTX 2060 GPU.
examples = []
block_size = 100
BATCH_SIZE = 16
BUFFER_SIZE = 1000for i in range(0, len(string_tokenized) - block_size + 1, block_size):
examples.append(string_tokenized[i:i + block_size])
inputs, labels = [], []for ex in examples:
inputs.append(ex[:-1])
labels.append(ex[1:])dataset = tf.data.Dataset.from_tensor_slices((inputs, labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
4. Model Training
Now comes the part we’ve been waiting for, making the model and training. So we define our optimizer, loss functions and the metrics, and start training.
# defining our optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)# definining our loss function
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)# defining our metric which we want to observe
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')# compiling the model
model.compile(optimizer=optimizer, loss=[loss, *[None] * model.config.n_layer], metrics=[metric])
Now, let’s train the model.
num_epoch = 10
history = model.fit(dataset, epochs=num_epoch)
5. Prediction
To predict, we just need to simply encode the input text and pass it to the model.
text = "লালমোহনবাবু "# encoding the input text
input_ids = tokenizer.encode(text, return_tensors='tf')# getting out output
beam_output = model.generate(
input_ids,
max_length = 50,
num_beams = 5,
temperature = 0.7,
no_repeat_ngram_size=2,
num_return_sequences=5
)
Now, if you are a Bengali, then you can point it out that the output although the sentence is syntactically correct, it doesn’t look cohesive. True, but for this demo, I have kept this demo as minimal as possible.
6. Save the Model
Well, after a long training time, what good will it do if we close our session and all our trained model is just lost, and we again need to train it from scratch? So, let’s save the model and the tokenizer so that we can retrain from where we left off.
from transformers import WEIGHTS_NAME, CONFIG_NAME
output_dir = './model_bn_custom/'# creating directory if it is not present
if not os.path.exists(output_dir):
os.mkdir(output_dir)model_to_save = model.module if hasattr(model, 'module') else modeloutput_model_file = os.path.join(output_dir, WEIGHTS_NAME)
output_config_file = os.path.join(output_dir, CONFIG_NAME)# save model and model configs
model.save_pretrained(output_dir)
model_to_save.config.to_json_file(output_config_file)# save tokenizer
tokenizer.save_pretrained(output_dir)
Bonus
We have already done all the hard work. So to load the saved model and tokenizer, we only need to execute two lines of code and we’re all set.
tokenizer = GPT2Tokenizer.from_pretrained(output_dir)
model = TFGPT2LMHeadModel.from_pretrained(output_dir)
Voila! Now you can train your own model in your own language. And create content which can race with some of the best literary works in any language.
Future scope:
This blog gives a framework of how can one train GPT-2 model in any language. This is not at par with some of the pre-trained models available. But to reach that state, we need a lot of training data and computational power.