Custom Text Generation Using GPT-2

Raji Rai
WiCDS
Published in
6 min readJan 16, 2021

Build a custom text generator in Python using the powerful OpenAI’s GPT-2 language model

Photo by Franki Chamaki on Unsplash

Generative Pre-trained Transformer-2 (a successor to GPT) is a state-of-the-art NLP framework from OpenAI. GPT-2 was trained over 40GB of data for text prediction/generation. However, to avoid malicious application of this, OpenAI released a smaller model for researchers to experiment with. The MIT Technology Review has quoted, “GPT-3 is shockingly good — and completely mindless”.

In this article while understanding the GPT-2 framework, you will learn how to run a GPT-2 model and then finetune it. GPT-2 is a pre-trained language model that can be used for various NLP tasks such as text generation, data summarization, and translation. Language models are statistical tools to predict/generate the next word(s) in a sequence based on preceding word(s).

The GPT-2 architecture is based on the Transformers concept. The Transformer provides a mechanism based on encoder-decoders to detect input-output dependencies. At every stage, the model takes the previously generated data as additional input when generating the next output. GPT-2 has outperformed other language models when it comes to generating articles based on small input content. With GPT-2’s chameleon like ability to adapt to the context of the text, it generates realistic and coherent output. Here’s a snapshot of the scores achieved by GPT-2 on various domain-specific language model tasks (courtesy: openai.com).

OpenAI has released four versions of GPT-2 based on the parameters to train, small (117M parameters) , medium (345M parameters), large (774M parameters) and XL (1.5B parameters). I am using Google Colab to show the training and finetuning of GPT-2 model, as Colab provides GPU runtime. If your system speed is good then you can run this locally also.

Steps to Execute GPT-2 Model:

  1. Set your Colab runtime to GPU.
  2. Connect to your Google drive.
from google.colab import drive
drive.mount(‘/content/drive’)
%cd /content/drive/MyDrive/
!mkdir gpt2
%cd gpt2/

3. Clone the GPT-2 repository from the OpenAI GitHub repo.

!git clone https://github.com/openai/gpt-2.git
%cd gpt-2

4. Use the “magic” command in Google Colab that instructs the Colab environment to use the newest stable release of TensorFlow version 1. This is necessary to run the selected GPT-2 without any errors.

%tensorflow_version 1.x

5. Run the requirements.txt file to fulfill certain essential requirements to get the expected result. You can ignore some of the warnings generated.

!pip install -r requirements.txt

6. Download the required GPT-2 model from the available four options, 124M, 355M, 774M, 1558M. Note that for bigger models TensorFlow with GPU support is required for faster execution.

!python download_model.py 124M

7. Generate samples. You can generate unconditional samples where you will not have control over the generated content. Other option is to go for conditional sample generation, where content is generated based on the prompt text you provide. Both options comes with a few flags available, with a default value. Check the https://github.com/openai/gpt-2/blob/master/DEVELOPERS.md for more information on this.

!python src/interactive_conditional_samples.py — top_k 40

8. In the Model prompt enter any text, based on which GPT-2 will generate the content samples.

Model prompt >>> newton invented======================================== SAMPLE 1 ======================================== a new form of light, so that you can see all aspects of it, from the inner workings of the universe to its internal laws. He was able to tell us what we did not need to know. A common denominator among all this is that light is a form of knowledge, so we would see an infinite number of laws as being necessary to explain some idea. If he had, he would have figured out exactly what they were. He should have been able to tell us from one to the next. Yet, this doesn’t happen……

Give different kinds of prompt text such as poem, song, blog extract, and see how the model performs. You can experiment with other GPT-2 models also and compare the output.

Now that you know how to run the pretrained GPT-2 model, I will show you the steps to finetune and generate customized text.

Steps to Finetune GPT-2 Model:

  1. Set your Colab runtime to GPU.
  2. Connect to your Google drive.
from google.colab import drive
drive.mount(‘/content/drive’)
%cd /content/drive/MyDrive/gpt2finetune/gpt-2

3. Clone the following GPT-2 repository.

!git clone https://github.com/nshepperd/gpt-2
%cd gpt-2

4. Download the required GPT-2 model from the available four options, 124M, 355M, 774M, 1558M.

!python download_model.py 355M

5. Run the requirements.txt file to fulfill certain essential requirements to get the expected result.

!pip install -r requirements.txt

6. Grant Colab read and execute access to the cloned folder.

!chmod 755 -R /content/drive/MyDrive/gpt2finetune/gpt-2

7. Prepare the file with the sample text that you want the GPT-2 model to be trained on, and upload it to your Google drive. Here I have placed a text file in the Corpus folder. It is required to override the standard stream settings in UTF-8 mode to handle any Unicode text characters in the sample text.

%cd /content/drive/MyDrive/Corpus
!export PYTHONIOENCODING=UTF-8
%cd /content/drive/MyDrive/gpt2finetune/gpt-2

8. Install TensorFlow for faster execution.

!pip install tensorflow-gpu==1.15.0 
!pip install ‘tensorflow-estimator<1.15.0rc0,>=1.14.0rc0’ — force-reinstall

9. Load the custom text for training the downloaded GPT-2 model.

!PYTHONPATH=src ./train.py — dataset /content/drive/MyDrive/Corpus/india.txt --model_name '345M'

The model will load the latest checkpoint from the specified model and train from there. You can specify the number of batches and the learning rate. Default batch size is 1 and learning rate is 0.00001. The model will save a checkpoint every 1000 steps. You can interrupt the training by stopping it at any point, and the last trained step will be saved.

10. Copy the newly created checkpoints into the downloaded model’s folder.

!cp -r /content/gpt-2/checkpoint/run1/* /content/gpt-2/models/345M/

11. Generate sample text based on your pretrained model.

!python src/interactive_conditional_samples.py — top_k 40

12. In the Model prompt enter a line of sample content related to your text used for training the model.

Model prompt >>> national parks in India======================================== SAMPLE 1 ========================================   Catch a glimpse of the furry wild cats, gaurs, blazing black deer, and many more species as you drift through the dense forest of Kanha National Park. Dwell in with the animals by wearing neutral colours for the safari as you drive slowly in your open jeep.  The Jeep Safari in Kanha National Park can be enjoyed twice a day for a specified time which neither disturbs the animals on odd hours nor does it lessen your possibility to spot the wild beings. Stay cautious and aware of what is happening in the Safari because if lucky you might sight a ‘barasingha’ which is found only here....

The content I had used for training was about various national parks in India, hence the generated text by the model is related to that. Isn’t it cool!

In this way you can finetune the GPT-2 model for different use cases such as academic, wiki content, technical papers, marketing content and so on. There are few notable Python packages available with custom finetuning capabilities that you can build on. One of them is GPT-2-Simple that wraps existing model fine-tuning and generation scripts for OpenAI’s GPT-2 text generation model. Go ahead and create your own custom text generator.

--

--