Fine Tuning GPT-2 for Magic the Gathering Flavour Text Generation
A template for fine-tuning your own GPT-2 model.
GPT-3 has dominated the NLP news cycle recently with its borderline magical performance in text generation, but for everyone without $1,000,000,000 of Azure compute credits there are still plenty of ways to experiment with language models on your own. Hugging Face is a free open source company focussing on NLP tooling and they provide one of the easiest ways of accessing pre-trained models and tokenizers for NLP experiments. In this article, I will share a method for fine tuning the 117M parameter GPT-2 model with a corpus of Magic the Gathering card flavour texts to create a flavour text generator. This will all be captured in a Colab notebook so you can copy and edit to create generators for your own tasks!
Starting Point
Generative language models require billions of data points and millions of dollars in compute power to train successfully from scratch. For example, GPT-3 cost an estimated $4.6 million dollars to train and 355 years of compute time. However, fine tuning many of these models for custom tasks is easily within reach to anyone with access to even a single GPU. For this project we will be using Colab, which comes with many common data science packages pre-installed, including PyTorch and free access to GPU resource.
First, we will install the Hugging Face transformers library, which will also fetch the excellent (and fast) tokenizers library. Although Hugging Face provide a resource for text datasets in their nlp library, I will be sourcing my own data for this project. If you don’t have a dataset or application in mind, the nlp library would provide an excellent starting place for easy data acquisiton.
The Hugging Face libraries will give us access to the GPT-2 model as well as it’s pretrained weights and biases, a configuration class, and a tokenizer to convert each word in our text dataset into a numerical representation to feed into the model for training. Tokenization is important as the models can’t work with text data directly so they need to be encoded into something more manageable. Below is an example of tokenization on some sample text to give a small representative example of what encoding provides.
The Data
Now it’s time to grab our data. For this project I’ll be using Magic the Gathering card flavour text from the Scryfall api, which returns an easily parsable JSON object of card data. Here, I extracted only English flavour text to avoid introducing new tokens for non-native words, as the GPT-2 model was originally trained on English-only data. After parsing, I was left with an iterable list object of 29222 MtG flavour texts, a preview of which is below.
The Loader
Now that we have our text data, we need to create a structured dataset and dataloader to appropriately feed into the model. For this step, we will use in-built PyTorch classes to define the dataset and dataloader, which will feed the neural network. The dataloader object is comprised of a dataset, a sampler, and provides single- or multi-process iterators over the dataset (see the official documentation for further information). There are a lot of details here, but the important points are:
- The dataset object will create a new list, which is a tuple of tensors.
- The first tensor is the encoded flavour text, wrapped in a start of text token, an end of text token and padded up to a maximum embedding length (if the length of the string is shorter than the maximum embedding space).
- The second tensor is an attention mask, which is a list of 1's and 0's that tells the model which tokens are important, 1’s, and which should be ignored, 0's.
The code for creating this dataset object is below and has been generalized to fit any tokenizer and datalist. It has also been padded up to a maximum length, which can be specified. The maximum length of all of the strings in my corpus was 98, so my tensors are only padded to a max length of 98 tokens. The maximum length possible for the GPT-2 tokenizer is 768, so keep in mind that specifying padding length will make a difference to the training speed of the model and the batch size you are able to allocate.
We now need to split the dataset into a training and validation set before creating the dataloaders. The code below shows an example of doing this to the MTGDataset we have created from the dataset template code, using the GPT2Tokenizer we instantiated, and dividing the data into 80:20 training/validation sets. It is important to note that different samplers are employed for the training and validation dataloaders. We want random sampling for the training data, but that isn’t required in the validation samples so these are tested sequentially.
The Model
Before training, we need to instantiate a few more things. First of all we should load and set the parameters of the GPT-2 model. Next, create an instance of the GPT-2 language model itself and configure it with the parameters we just set. Lastly, to speed up training we should run this on the available GPU, to do this we need to instruct PyTorch to load the data to the cuda device.
At this point, we need to investigate what type of instance we have connected to in Colab. We can check this by running !nvidia-smi, which will display the device information, including the GPU model, P100, K80, T4 etc., as well as the amount of VRAM available to the device. This information is crucial because it will inform our choice of batch size. Setting the batch to the maximum you can fit into memory is generally good practice. On a T4 or K80 we can set the batch to 32 on this particular data, otherwise the batch much be set smaller or the data will fail to load to the GPU and training won’t start. More VRAM will enable larger batch sizes, which will make training faster.
Now we can set the epoch number (number of training cycles) and create the optimizer we will use for training. We will be using the Hugging Face implementation of AdamW, though other optimizers are acceptable. Fastai have a wonderful blog post explaining the AdamW optimizer, including a brief history and the recent tweaks that led to its current state.
At this stage, we could fine tune other hyperparameters, such as the learning rate, the beta and epsilon values of the optimizer, or vary the batch size or epoch number. If we are otherwise happy with the defaults, we can establish the training loop and begin!
Training
The code for the training loop is below. For anyone unfamiliar with neural network training I’ll try to provide an accessible description of the basic work flow this code encapsulate:
- The training batch is loaded to the GPU and the network makes predictions on some labels.
- The performance of the model is accessed and the loss, how far the predictions are from the truth, is found.
- The derivative of the loss is calculated and the optimizer will move down the gradient towards some minima.
- The changes to the model that reflect this step are then back propogated through the model, the weights are updated at each layer of the model and the next sample is tested.
This process repeats for every training batch and ideally the model will equilibriate at a position of minimized global loss. To see if the model generalizes well to data it hasn’t seen it is tested on the validation data. After this point, the model is fine tuned on our new dataset and we can examine the overall model performance and test the outputs to see how well this worked!
Training Made Easy
Shortly after this article was first published Julien Chaumond pointed me to the new Trainer class in transformers which makes this training loop significantly more concise and offers several other benefits as well. The trainer even makes some of the Dataloader instantiation obsolete as you only need to provide dataset objects and it will automatically create the loaders and even use random samplers in the training and sequential samplers in the validation set, precisely as we have configured. It will even prompt you to log in to a service like Weights and Biases to log your model training and can configure your model to train across multiple devices including TPU’s. Unless you really want to specifiy every detail of the training cycle manually I would highly recommended using this method.
Evaluation
First, we will examine the shape of the loss curves for the training and validation, shown below. This is a good outcome given that many of the model parameters are defaults, as the training loss doesn’t dip too far below the validation loss, which would have indicated possible over-fitting.
Now for the fun part! We will evaluate the model outputs from a human perspective. Below, are five examples of the model outputs. I’m pretty pleased with these! They read like the flavour texts I put into the model and a quick check shows they aren’t duplicates of any existing cards from the corpus. It even uses in some places the quote attribution to real entities from MtG and in the correct location of the flavour text so this model definitely appears to understand the strucutre of our data. Overall, I would say that the training has gone really well and that the fine tuning has produced a new model that successfully generates novel MtG flavour texts!
Future Changes
There are some obvious changes that we could make to this work flow that might improve the model. Some hyperparameter tuning could create more ‘accurate’ outputs, but that would be a difficult metric to calculate in this instance. Different language models, or a larger version than the 117M parameter GPT-2 model could have been fine tuned as well. It is also possible my scraping removed too many data points, or that the API I chose didn’t contain every possible flavour text and a more exhaustive search would return a richer dataset.
If replicating this work flow isn’t for you, but you want to use this generator to create something on your own, this model is now hosted by Hugging Face. The below embedded URL will take you straight to the model’s home page.
By following the instructions on that page, or from the code chunk below you can load the Magic-The-Generating model straight from the transformers library into your local Python environment.
Acknowledgements
I think it’s really important to give credit where credit is due within the open source community. So if you found this blog entertaining or informative, please take a moment to visit these excellent resources for more language modelling content, tutorials and projects:
- Rey Farhan, who was my inspiration for this particular project
- Chris McCormick’s BERT fine-tuning tutorial (heavily cited by Rey)
- Ian Porter’s GPT-2 tutorial
- Hugging Face Language model fine-tuning script