Text Generation in any language with GPT-2

A step-by-step guide to train your own GPT-2 model for text generation in your choice of language from scratch

Arshabhi Kayal
Engineered @ Publicis Sapient
6 min readSep 1, 2020

--

Photo by Natalia Y on Unsplash

Note: This blog was originally posted in this following link.

We all heard modern-day Natural Language Processing (NLP) has progressed by leaps and bounds in the past couple of years following the development of attention networks and transformers. It paved the way for a plethora of new algorithms achieving State-Of-The-Art (SOTA) for the different tasks of NLP.

OpenAI has been one of the leaders in providing their own language model (now released GPT-3) which is trained on a huge corpus of internet data. Since GPT-3 is a recent phenomenon and in English at the moment, and is only accessible through API given by OpenAI, we shift our focus on the earlier version of it, i.e. GPT-2. To know about the internal nuts and bolts of GPT-2, I suggest you to go through this link. For more depths into Attention and Transformers, here are some excellent links:

GPT-2 was also released for English, which makes it difficult for someone trying to generate text in a different language.

So why not train your own GPT-2 model on your favorite language for text generation? That is exactly what we are going to do. So, without further ado, let us jump in.

For the demo, I have considered a non-Latin alphabet script (Bengali here), because why not? I have used Huggingface’s implementation for the model.

1. Gathering the data

Gathering good quality data is one of the most important stages as all Data Scientists would agree. So we are going to assume that you already have a folder containing .txt files having all the data cleaned and stored. For ease, you can use the Wikipedia article data, which is available and can be downloaded with the following code:

This will create a folder containing all Wikipedia files looking like:

Screenshot of File List (Source: Author)

Note: Due to resource constraint, and since it is for demo purpose, I have trained the model in a small subset of books by Satyajit Ray, especially his detective Feluda series.

2. Tokenization

Now, the second step will be to tokenize the data. For that, we use the following class:

Some notes on the tokenization:

  • We use BPE (Byte Pair Encoding), which is a sub-word encoding. This generally takes care of not treating different forms of word as different. (E.g., ‘greatest’ will be treated as two tokens: ‘great’ and ‘est’ which is advantageous since it retains the similarity between great and greatest, while ‘greatest’ has another token ‘est’ added, which makes it different). Also, it is not as low level as character-level encoding, which doesn’t retain any value of a particular word.
  • Another small but subtle point is NFKC (Normalization Form Compatibility Composition) in line 13 of code. It is one of the standard Unicode compatibility forms. It would not matter much if the language is English, but since we are using Bengali, which contains a different form of character, we are using this specific one. More on it can be found on this link.

So what we do here is tokenize our data and save it in a folder. Two files will be created (merges.txt and vocab.json) in a specified directory. To run the file, use the following code:

3. Model Initialization

Before the real magic begins, we need to make sure the artilleries are ready. Let us start with some initializations.

We also create a single string from all our documents and tokenize it.

After we have encoded the whole string, we now move on to make a TensorFlow dataset, slicing the data into equal intervals, so that our model can learn. Here we use a block size of 100 (length of token in each example) and a batch size of 16. This is kept low, else we can run it with ease on a RTX 2060 GPU.

4. Model Training

Now comes the part we’ve been waiting for, making the model and training. So we define our optimizer, loss functions and the metrics, and start training.

Now, let’s train the model.

5. Prediction

To predict, we just need to simply encode the input text and pass it to the model.

Screenshot of the Output (Source: Author)

Now, if you are a Bengali, then you can point it out that the output although the sentence is syntactically correct, it doesn’t look cohesive. True, but for this demo, I have kept this demo as minimal as possible.

6. Save the Model

Well, after a long training time, what good will it do if we close our session and all our trained model is just lost, and we again need to train it from scratch? So, let’s save the model and the tokenizer so that we can retrain from where we left off.

Bonus

We have already done all the hard work. So to load the saved model and tokenizer, we only need to execute two lines of code and we’re all set.

Voila! Now you can train your own model in your own language. And create content which can race with some of the best literary works in any language.

Future scope:

This blog gives a framework of how can one train GPT-2 model in any language. This is not at par with some of the pre-trained models available. But to reach that state, we need a lot of training data and computational power.

References:

--

--