Transformers library implements several state-of-the-art transformer architectures used for NLP tasks like text classification, information extraction, question answering, and text generation. It is used by researchers and companies alike, offering PyTorch and TensorFlow front-ends.
Since the release of our TensorFlow implementation, we have been working on productionizing the models and making them available on TPU, slowly gearing ourselves towards performance.
This post compares the performance of our models in several environments. We compare them for inference, on CPU and GPU for PyTorch (1.3.0) as well as TensorFlow (2.0). As several factors affect benchmarks, this is the first of a series of blogposts concerning benchmarks and subsequent performance optimizations.
In addition to this post, we are creating
Benchmark section in our documentation, which will evolve as we further work on our models and benchmark them in different settings.
The results are visible in this Google Spreadsheet. The average results are visible in the table below. The results are detailed in the discussion section.
N/A entries in the spreadsheet indicate either an out-of-memory error or an inappropriate sequence length. Transformer-XL does not have TorchScript results as it is not currently serializable by TorchScript.
In most cases, the TensorFlow and PyTorch models obtain very similar results, both on GPU and CPU. Down below is a short discussion concerning the results, both as a comparison between PyTorch and TensorFlow as well as a comparison between models.
Inference time is an important metric when putting a model in production. In order to evaluate the inference times of our models, we compare them with different batch sizes and different sequence lengths. We compare the reasonable batch sizes
[1, 2, 4, 8] with the sequence lengths
[8, 64, 128, 256, 512, 1024] . The batch sizes remain small as we are exclusively looking at an inference setup. BERT and other similar models have a maximum sequence length of 512 or 256 (for CTRL) and will therefore not be measured on the last sequence lengths.
We test the results in two different environments:
- on CPU, using a GCP n1-standard-32 which has 32 vCPUs and 120GB of RAM. The CPU model is an Intel Xeon @ 2.3GHz.
- on GPU, using a custom GCP machine that has 12 vCPUs, 40GB of RAM and a single V100 GPU (16GB VRAM).
Experiment details & best practices
In order to maximize performance, further optimizations are made:
- The Intel Xeon CPU on which we measure the CPU inference comes with AVX and AVX2 extensions. As TensorFlow requires to be compiled from source to leverage those extensions, we do so.
- We make sure we are not using TensorFlow’s eager mode by using
tf.functionand tracing the models beforehand.
- We compare the inference with and without the library-dependant tools: TorchScript for PyTorch, and XLA (Auto-clustering) for TensorFlow with GPUs. These two tools are detailed below.
- We use the native Python module
timeitto measure the inference time. We run each of our experiments with
number=3. We then average over the 30 values to get the expected average inference time. Averaging over 30 values yields very stable results.
- We do not make use of production environments such as TFX, and we measure the models’ callable method:
nn.Module.forwardfor PyTorch and
- We are careful to use the appropriate CUDA versions for both TensorFlow and PyTorch.
PyTorch and TensorFlow
Both libraries obtain similar results in most cases, with TensorFlow generally being a bit slower on CPU compared to PyTorch, but a bit faster on GPU:
- Across all models, on CPU, PyTorch has an average inference time of 0.748s while TensorFlow has an average of 0.823s.
- Across all models, on GPU, PyTorch has an average inference time of 0.046s whereas TensorFlow has an average inference time of 0.043s.
These results compare the inference time across all models by averaging the results. As a consequence, the larger the input size, the larger the impact on the final result. PyTorch runs out of memory when the input sizes are too large; those results are removed from all measures when averaging as it would skew the results towards PyTorch.
The PyTorch models tend to run out of memory earlier than the TensorFlow models: apart from the Distilled models, PyTorch runs out of memory when the input size reaches a batch size of 8 and a sequence length of 1024.
TorchScript is PyTorch’s way of creating serializable models that can run on different runtimes, with no need for Python dependencies, such as C++ environments. Our tests were done by tracing the model in Python and re-using that traced model in the same environment. We make sure to trace the model before measuring its inference by executing a forward pass beforehand.
Disclaimer: while TorchScript does not seem to be inherently created for speed-up in a Python environment, our results show that tracing the model with TorchScript can yield performance improvements.
TorchScript seems to be very dependent on the models and the input size (batch size * sequence length); as an example, using TorchScript yields a permanent performance boost on XLNet whereas its use may be questionable on XLM, where it increases performance in smaller input sizes but decreases performance in larger input sizes.
On average, an inference with a model traced with TorchScript is 20% faster than an inference with the same PyTorch non-traced model.
XLA is a linear algebra compiler that can accelerate TensorFlow models. We’re using it solely on GPU where it is based on TensorFlow’s Auto-clustering which compiles some of our models’ subgraphs.
The results are improvements in speed and memory usage: most internal benchmarks run ~1.15x faster after XLA is enabled.
We obtain an increase in performance with all of our models when XLA is enabled. In some extreme cases, we obtain a decrease of 70% in inference time, especially in lower input sizes
Models and their distilled version
Distilled models shine in this test as being very quick to benchmark. Both of the Hugging Face-engineered-models, DistilBERT and DistilGPT-2, see their inference times halved when compared to their teacher models.
As benchmarking on all different setups, with every tool, isn’t achievable by a single organization, we welcome benchmarks from the community. The Github user @tlkh has already contributed by benchmarking performances that could be achieved using AMP, XLA and distributed strategies on our TensorFlow models. It is currently being added to the benchmarking section of the documentation.
How to contribute
If you would like to contribute, we have set up issues templates on our Github to make it easier. Feel free to open an issue with your results, or to open a pull request with your additions to the benchmark section of the documentation.
Accompanying the release of this blog post and the Benchmark page on our documentation, we add a new script in our example section:
benchmarks.py , which is the script used to obtain the results detailed below. It can run benchmarks on TensorFlow, on PyTorch, using XLA or TorchScript and save the results to a CSV file.
Benchmarking our models is but the first step on our road to speed performance. We believe this introductory article may be of help when looking to compare the current state of our models, especially when looking at the difference between PyTorch and TensorFlow. As we delve in the production aspects of
Transformers , we are bound to work on performance-oriented improvements.
Automated scripts, new architectures and custom TPU training for PyTorch and TensorFlow: keep an eye out for future releases!