The murky world of natural language processing [1].

The Consequence of Modern NLP Approaches…

…and how to remedy them using Large Model Support

Published in

Systems AI

10 min readOct 31, 2019

Recent advances within the world of Natural Language Processing (NLP) have heralded what some are calling NLP’s ImageNet moment [2]- referring to the watershed moment in which a deep neural network trained on the Stanford ImageNet dataset performed 40% better than the closest competitor at visual recognition. This ushered in a new era of machine learning research.

In a nutshell, the shift in NLP models has been the use of deeply pre-trained language models, which can then be used for a wide range of language tasks. Breaking the italicised text down further should help to illuminate what this really means. Working backwards:

language models: The goal of language modelling is to estimate the probability distribution of various linguistic units, e.g., words, sentences etc [3]. This can be thought of as similar to the English language model we carry ourselves in order to read this article, and extract information from it. A performant language model allows you to tackle a number of language related tasks, such as relationship extraction, named entity recognition, sentiment analysis etc.
pre-trained: The weights of the model are not randomly initialised, and have instead been set through training on a large, general corpus (commonly the entirety of Wikipedia, and a massive number of books). The advantage of pre-training is that it reduces the volume of data required to get meaningful results, when compared to training from scratch. This is because the language model will have already generated a language model with a distribution that overlaps with the desired distribution for the problem you are trying to solve. This is an example of the power of transfer learning.
deeply: Pre-trained language models have been around for some time, e.g. word2vec, GLoVE (used to produce word embeddings). However, these have always been shallow- with only a single or couple of layers being pre-trained, with the remaining layers being randomly initialised. Modern architectures, typically based on the Transformer [4], make use of deep pre-training, where all of the layer parameters within the network are generated from pre-training on a massive initial corpus of text.

Machine learning practitioners can now make use of readily available pre-trained models, which can then be fine-tuned on a much smaller labelled dataset. This leads to a greater range of problems which can be tackled using NLP, as well as faster time-to-results in existing domains.

The Consequences…

The rise in popularity of deep pre-training, fuelled by its ability to achieve state of the art results, has led to an exponential increase in the size of modern NLP architectures. This has been mirrored in an exponential increase in the required computational power needed to both train, and perform inference against these networks.

Figure 1: An exponential increase in parameters!

First; considering the impact on inference. The mean pause between replies in a conversation is around 250ms [5]. For conversational agents to engage at a natural cadence they must fit within this time envelope. Modern conversational agents will have differing component parts. However, as a whole, the data pipelines are likely to be complex- stringing together multiple neural networks. This results in each individual neural network having a mere 10ms for inference [6].

A real-world limitation of this occurs when using a BERT-Base*[7] model within a CPU-only environment. After model optimisation and fine-tuning the model returns predictions after a ~40 ms delay [8]. GPU-based platforms are significantly more performant and achieve inference under the 10ms latency limit. However, given current trends in model footprint and the lack of prevalence of GPU-based systems in production environments this still poses a real barrier to deploying BERT and similar models at scale.

On the training side of the coin, greater amounts of computation are being employed in the training of these models. Nvidia recently trained BERT-Large from scratch in under 47 minutes, using 1,472 V100 GPUs [9]. In the original paper it took 4 days, using a cluster of 64 TPU chips.

However, throwing an exponential amount of grunt at these models is clearly impractical- leading to ever-rising financial, and environmental costs. Therefore, researchers have been investigating ways of practically training these monsters. In this article I will focus on a tensor-swapping technique pioneered by IBM called Large Model Support (LMS). Other methods such as Gradient Accumulation [10], and Gradient Checkpointing [11] are alternate avenues to explore too if you wish to drive down memory footprint.

Applying Large Model Support

LMS works by swapping tensors created during training from GPU memory to system memory- thereby reducing the footprint of the training process. This is done in the TensorFlow implementation by adding swap in/out nodes within the computational graph prior to the graph being executed. The PyTorch implementation is more dynamic, and swaps tensors based upon the current GPU memory utilisation- with the introduction of TensorFlow 2.0 it is likely that this will become the de-facto for TensorFlow too.

LMS is a Python library available for free through the Watson Machine Learning Community Edition [12] bundle of Python packages. LMS supports both POWER and x86 processor architectures, and has been tested on P100, and V100 GPUs. It is optimised for use on the IBM Accelerated Computing [13] platform which uses a unique CPU-GPU NVLink to enable near coherent tensor swapping- reducing the overhead introduced through shuttling tensors between GPU and CPU memory.

To demonstrate the potential gains of using LMS we will fine-tune BERT-Large on a 16GB P100 GPU. BERT-Large is so tricky to train on cards with 12–16GB memory that the researchers have a section within their Github explaining that it would do more harm than good to attempt fine-tuning the model. Hopefully, with the use of LMS, I can demonstrate a more efficient way to train the model- without compromising on performance. To be consistent with the original research paper we will use 32 bit floats.

The two parameters which are the main determinants of BERT’s memory usage are the maximum sequence length and the batch size. The maximum sequence length sets the number of tokens which the BERT model considers as input to the model- Figure 2 displays the BERT input representation of tokens and the three different embeddings. While, the batch size determines the number of those sequences fed to the GPU at any one time during an epoch. If the maximum sequence length can be increased then more semantic information will be passed through to model in each forward/backward pass. While a higher batch size improves time to result, by allowing faster complete passes through the dataset.

Figure 2: Sequence representation given as input to BERT [14].

I will make use of SQuAD (Stanford Question and Answering Dataset) 1.1. The research team behind BERT provide a run_squad.py script to make it easy to kick off fine-tuning jobs of different BERT models on SQuAD.

Initially, I performed a simple verification training run on BERT-Base** which produced metrics comparable to those recorded by the research team. I then focused on generating the highest parameter set for max_seq_length and train_batch_size without producing out-of-memory errors (very easy to do!) and without enabling LMS. To be as closely comparable to the results in the original research paper I made use of the full BERT-Large model.

With max_seq_length = 512 a maximum value of train_batch_size = 1 is achieved without encountering out-of-memory errors, with a throughput of 2.9 examples per second. When using a much smaller max_seq_length = 128 a maximum value of train_batch_size = 8 is obtained, with a throughput of 16.7 examples per second. The throughput value is important as it will be used to measure the swapping overhead I pay to make use of LMS.

Next I will enable LMS in the run_squad.py script. This is very simple to do, and requires only a few lines of code.

# Import the TF LMS module
from tensorflow_large_model_support import LMS# Instantiate the LMS object, with maximum swapping parameters 
# If you wanted to make use of the auto-tuning feature simply 
# initialise the LMS object without any arguments 
# e.g. lms_hook = LMS()
lms_hook = LMS(swapout_threshold=1,
               swapin_ahead=0,
               swapin_groupby=0,
               sync_mode=0)# Make LMS aware of the train_batch_size parameter
lms_hook.batch_size = FLAGS.train_batch_size# Include the lms_hook object in the estimator hooks list
estimator.train(input_fn=train_input_fn,
                max_steps=num_train_steps,
                hooks=[lms_hook])

You will notice that LMS introduces four hyper-parameters to work with. Typically I would not need to worry about them, as LMS introduces an auto-tuning feature which automatically evaluates your computational graph and sets appropriate values for these hyper-parameters, based upon estimated memory consumption throughout training. However manual tuning allows for closer control- squeezing out maximum performance. The four hyper-parameters introduced are:

swapout_threshold: The number of tensors to hold within GPU memory before pushing them to system memory.
swapin_ahead: The larger swapin_ahead is, the earlier a tensor is swapped in to the GPU memory from the host memory.
swapin_groupby: Multiple swap-in operations of the same tensor will be grouped or fused into one swap-in operation for better performance if they are close to each other (the distance between them is within swapin_groupby).
sync_mode: Whether to do synchronisation between data transfer and kernel computation or not.

Initially, I specify a LMS hyper-parameter set which introduces the maximum amount of swapping. This generates the greatest swapping overhead, but minimises the memory footprint. Once I understand the maximum values of max_seq_length and train_batch_size possible using LMS, I can then begin to increase these hyper-parameters, and remove much of the swapping overhead.

So, what impact did this have on fine-tuning BERT?

The results for the smaller sequence length example of 128, applying LMS allowed an increase of the batch size from 8 all the way up to 96- a 12x jump. Meanwhile, the example withmax_seq_length = 512 applying LMS meant that the batch size was increased by 10x, from 1 to 10.

This resulted in an increase in throughput as well, with the number of examples per second processed during training increasing by 1.5x, from 2.9 to 4.3 examples/sec.

Focusing, on the case where max_seq_length = 512 with the train_batch_size=10. The final LMS parameter set which I settled on was:

lms_hook = LMS(swapout_threshold=712,
               swapin_ahead=248,
               swapin_groupby=16,
               sync_mode=0)

However these increases in batch size and throughput are only relevant if they improve the accuracy, and/or the time to result. The research team behind BERT even state on the official Github page regarding training on a 12–16GB memory GPU:

Unfortunately, these max batch sizes for BERT-Large are so small that they will actually harm the model accuracy, regardless of the learning rate used.

With this in mind, I’ll try to improve both the accuracy of the fine-tuned BERT-Large model, and the time to result.

The authors report that training BERT-Large on a cloud TPU, using max_seq_length = 384, train_batch_size=24, consistently results in an F1 score between 90.5-91.0%. They present the results of a single run:

{“exact_match": 84.38978240302744, “f1": 90.87081895814865}

This will be the accuracy score to beat for our benchmarking runs.

Benchmarking BERT-Large without LMS enabled using max_seq_length = 512, train_batch_size=1 verifies the researchers findings that accuracy is harmed at such low batch sizes:

{“exact_match": 82.62062440870388, “f1": 89.61706821966973}

Both the exact match, and the F1 metric are damaged by the low batch size, despite the higher sequence length used. This training run took a total of 17 hours and 20 minutes on a single 16GB P100 card.

When enabling LMS to allow the training batch size to be increased to 10, we see a resultant rise in the accuracy metrics:

{“exact_match": 86.58467360454115, “f1": 92.8553206131807}

This therefore exceeds the accuracy obtained on a cloud TPU, which has 4x more memory than the P100 card we trained on, and demonstrates that BERT-Large can be functionally trained on a 16GB card. The training time for this run took a total of 12 hours and 23 minutes. 5 hours faster than the training run without LMS enabled.

Conclusion

To recap, we first considered the advances in NLP and how deeply pre-trained language models have been at the forefront of these advances. We then considered the memory limitations of these language models- due to exponentially growing parameter sets. BERT-Large was then trained with and without LMS, by changing the memory hungry hyper-parameters of max_seq_length and train_batch_size. The addition of LMS to the training routine allowed BERT-Large to be trained on the memory constrained P100 card, improving the accuracy metrics, and the time to result.

Future steps

This was an initial foray into applying LMS to a modern transformer architecture. In the future, more architectures should be considered, and perhaps novel changes to these architectures will be permitted thanks to the coherent data transfer enabled by LMS.

Specifically to training BERT-Large, in the future I would like to run further independent training epochs to verify that these were not one-off results. Run on a 32GB V100 card to understand what the next generation of GPUs can achieve on this model. Use 16 bit floats to see the impact on accuracy and time to result of half precision training.

Furthermore, need to provision a cloud TPU to understand the time to result on that platform vs. the LMS enabled P100 GPU.

Notes

*For the remainder of this article BERT [7] will be used as an example. BERT stands Bidirectional Encoder Representations from Transformers, and is a method for pre-training language models. It obtains state of the art results on a wide-range of language processing tasks. BERT-Base is a condensed 12-layer, 110 million parameter network released alongside the state of the art BERT-Large model- which is double the size in terms of layers and approximately triple in the number of parameters.

**Results are comparable to those published, verifying the platform being used. The fine tuning took 166 minutes on a single P100. The prediction scores of the fine tuning are as follows: {“exact_match": 81.00283822138127, “f1": 88.30995482677369}. Usingmax_seq_length = 384and train_batch_size = 12. Platform specification: