Training a Dutch GPT-2 base model

Geert Jonker
Deepdesk
Published in
6 min readDec 4, 2020
source: Pattern Vectors by Vecteezy

Deepdesk suggests texts to agents in customer contact centers, in a form known as autocomplete or smart compose. We use several techniques to do that — fast algorithms for the easier cases and heavy algorithms for the hard cases. GPT-2 is an example of the latter. It currently produces 10–15% of the text we suggest, and we aim to increase that.

The publicly available, pre-trained GPT-2 models are trained on English text. One can fine-tune these on another language, say Dutch, and get Dutch text generation. While this is nice, the quality is not optimal. It would be better to train a Dutch model from scratch. This is a daunting task however. First, one must obtain tens of gigabytes of Dutch text of reasonable quality. Second, one must train GPT-2 models on this data, a very resource intensive task, especially for the larger model variants.

Not deterred by any of this, we set out to do it. Our goal was to train a Dutch base model on generic text, and fine-tune that on customer-specific text, for our customers.

First, a Dutch tokenizer

Since we start from scratch, we can train a tokenizer from scratch as well. This is beneficial, since the original tokenizer is trained on English, and thus optimal for English. An experiment by Pierre Guillou with Portuguese text showed that a tokenizer trained on Portuguese was 41% more efficient in encoding Portuguese than the pre-trained GPT-2 tokenizer trained on English. Less tokens to encode a given piece of text means that more text can fit into the fixed-sized input of the model, and more text is generated per inference step.

“On average, when a Portuguese word is tokenized with 2.25 tokens by the English tokenizer, it is tokenized with only 1.36 tokens by the Portuguese one: an increase rate of 66%!”
Source: Byte-level BPE, an universal tokenizer but… — Pierre Guillou

Now we need training data

For generic Dutch text we fetched all Dutch texts from Common Crawl, retaining only those from domain “.nl”. From those we kept only the sentences longer than 100 characters, where the character distribution resembled natural text. This was needed to filter out all the menu items, Javascript and other gibberish. This left us with 67GB of data.

We trained a tokenizer on this data using Hugging Face’s tokenizers. The efficiency gain was identical to the one observed on Portuguese: 41% less tokens to encode a given piece of text!

Finally, train GPT-2

Now it was time to train GPT-2 from scratch. We trained the ‘small’ model with 117M parameters. Larger models are better of course, but more resource-intensive to train and to use. The 117M model proved challenging enough! We used Hugging Face’s transformers project for this.

The first obstacle was preprocessing the corpus, where text is tokenized and training samples are created. The preprocessing code in transformers is not written for huge corpora. Preprocessing 1GB of text took a whopping 40GB of memory. We solved this by rewriting the code to be more memory-efficient, by doing things in a streaming fashion. Not saving the result to disk using pickle also helped, as this is very memory intensive. Doing this decreased memory usage of preprocessing 1GB of data to 6GB. We were able to preprocess the full 67GB corpus on a machine with 190GB memory.

Another problem was particularly nasty: large trainings would randomly crash with nothing more than a segmentation fault. Juggling base image versions, drivers and other moving parts didn’t help. We tackled this problem by designating the directory where the checkpoints were stored as an output directory in Kubeflow, resulting in the contents of that directory being stored in a cloud bucket, which, with some effort, could be salvaged and further trained upon.

Training the base model was really a repeating exercise of hitting a resource limit, increasing that limit or implement savings, and starting again. GPUs, memory, disk, everything. When it finally succeeded, the training took 12 days with 8 V100 GPUs, costing about $6k.

The results

Perplexity. Huh?

The model has a perplexity of 11.1. Is that good? Our current model, English base with 3GB Dutch text fine-tuning, has perplexity 6, So it looks like we are worse off. But there is a catch. Perplexity is defined (at least in the transformers project) on token-level: given a sequence of tokens, how well is the model able to predict the next token? How text is tokenized depends on the tokenizer. A tokenizer trained on English will tokenize Dutch relatively inefficiently, with more tokens per word on average than a tokenizer trained on Dutch. So using the English tokenizer for Dutch text makes predicting the next token more a task of predicting the next syllable, whereas it was more predicting the next word in English on English text. Predicting the next syllable is on average an easier task than predicting the next word, therefore the perplexity is lower with an English tokenizer. Ergo, you should not compare perplexities on validation texts in different languages.

To be able to compare the performance of models independent of the tokenizer used, we wrote a benchmark that simply counts the number of characters correctly generated by the model, given inputs taken randomly from a validation set.

Average number of characters correct. The big jumps are caused by larger pieces of often occurring text that the fine-tuned models do well on.

At the bottom we have the Dutch base model, one trained on 10GB text, one trained on all 67GB text, not fine-tuned, averaging less than 3 characters correct per generation. Then, two English base models fine-tuned with Dutch text, averaging around 5 and 9 characters correct respectively. Finally, the Dutch base model (10 and 67GB) fine-tuned with Dutch text, averaging around 14 characters correct.

A bit surprising to us is the relative small improvement that the 67GB base model brings over the 10GB base model both fine-tuned and not fine-tuned. We had expected more.

Speed

For the all-Dutch model, we see an increase in speed, both in processing the input and in generating text, of 20% compared to the English/Dutch model. We generate relatively short amounts of text and stop generating when the cumulative probability falls under a threshold. As the all-Dutch model generates more characters before stopping, the total time spent is the same in the end.

Conclusion

We didn’t expect training a GPT-2 model from scratch to be a walk in the park, and it wasn’t. GPUs, drivers, out-of-memory errors, segmentation faults, you gotta like it. The work did pay off however, we realized a nice increase in text generation quality and speed. We would like to train a medium model, and train more than one epoch, but probably need to move to TensorFlow and TPUs for that. Anyone willing to help? 🙂

--

--