BetterTransformer, Out of the Box Performance for Hugging Face Transformers

Younes Belkada

Follow

Published in

PyTorch

6 min readNov 17, 2022

--

Hugging Face meets PyTorch to integrate ‘BetterTransformer’ in its ecosystem

A few months ago, PyTorch launched BetterTransformer (BT) that provides a significant speedup on Encoder-based models for all modalities (text, image, audio) using the so-called fastpath execution and fused kernels. We joined forces and made the feature available as a one-liner for the most important models in the Hugging Face ecosystem:

better_model = BetterTransformer.transform(model)

Let us explore in this blogpost how this technology works under the hood and how to efficiently use it in your models in Hugging Face using 🤗 Transformers and 🤗 Optimum libraries!

Check out also the Google Colab demo to reproduce the experiments and try them out by your own, and the 🤗 Optimum documentation for a list of currently supported architectures!

Authors: Hugging Face team: Younes Belkada, Felix Marty, Michael Benayoun | PyTorch team: Eric Han, Hamid Shojanazeri, Christian Puhrsch, Driss Guessous, Michael Gschwind, Geeta Chauhan

How to use BetterTransformer in the Hugging Face ecosystem

Install dependencies

Make sure to use the latest stable version of PyTorch by following the installation guidelines from the official website. Make sure to have optimum and transformers packages installed and you should be ready to go!

pip install transformers optimum

Classic usage

Once you load your model using transformers, use the BetterTransformerAPI from 🤗 Optimum to convert the original model to its BetterTransformer version.

from transformers import AutoModel
from optimum.bettertransformer import BetterTransformer

model_name = "roberta-base"
model = AutoModel.from_pretrained(model_name).to("cuda:0")

better_model = BetterTransformer.transform(model)

BetterTransformer and pipeline

The BetterTransformerAPI from Optimum includes also the pipeline integration. If you are familiar with it from transformers, you can simply import pipeline from optimum and run your preferred pipelines as follows!

from optimum.pipelines import pipeline
unmasked = pipeline(
    "fill-mask", 
    "distilbert-base-uncased", 
    accelerator="bettertransformer", 
    device=0
)

unmasked("I am a student at [MASK] University")

For a more detailed example, we put up a Google Colab that shows how to benchmark BetterTransformer with the original model.

BetterTransformer in a nutshell

BetterTransformer takes advantage of two optimization techniques in its fastpath execution. First, a base speedup is achieved through the use of fused kernels implementing multiple operations more efficiently in a single kernel. Second, BetterTransformer exploits the sparsity due to padded tokens, related to the padding ratio in the input token sequences.

Let us now quickly understand the concept of a fused operator and how sparsity is leveraged to make execution of transformer encoder faster!

What is a fused operator?

“Kernel fusion” means to write a set of sequential operations in a single compiled and optimized so called “kernel” that will be called at runtime for faster execution. In practice these kernels are useful to enhance the throughput of model execution at training and inference.

*Diagram of the Transformer Encoder Architecture (from* *“Attention Is All You Need”*):
The fused TransformerEncoder operator includes multiple constituent inputs in a single optimized operator, combining operations and processes both dense torch.tensor and variable-length torch.nested_tensor. TransformerEncoder calls the fused Multi-Head Attention operator which includes several fused kernels combining multiple kernels each. Image from *PyTorch blogpost*

In BetterTransformer, TransformerEncoderLayer uses several fused kernels that combine, for example, the Relu attention as epilogue of a prior matrix multiply kernel, and many similar fusions. This reduces both the number of instructions to be executed and memory bandwidth by doing several computation steps as part of a single kernel.

The entire encoder layer operation is implemented as a single operator, which makes its execution faster than running each operation of the transformer encoder step by step.

How sparsity is exploited?

For most of the NLP systems, padding tokens are exploited to perform batched processing and avoid computing attention scores on these tokens. Note that this concept is effective for NLP and most audio models.

BetterTransformer exploits sparsity by simply avoiding unnecessary computation of padding tokens using nested tensors.

This is particularly effective for larger input (large sequence length and large batch size) and models can benefit from very significant speedups when the percentage of padding tokens is high.

*BetterTransformer* takes advantage of variable-length sequence information:
speedup and efficiency for BetterTransformer fastpath execution and sparsity optimization using torch.nested_tensor scale as padding amount improves in a batch. Link to original image: *PyTorch blogpost*

For modalities that do not use padding tokens (e.g., vision models) the speedup will mainly vary with respect to the size of sequences being processed and the hardware type.

Expected speedups

The base speedup is very sensitive to the hardware you are using, for newest hardware (NVIDIA A100, V100) the speedup is much more significant than older GPUs (NVIDIA T4, etc.). Note also that a speedup is also observed on CPUs, which makes the method in general faster than the native implementation on the most used hardware.

Check below the speedup results on a NVIDIA T4 using distilbert-base-uncased in half precision mode (torch.float16):

*Experiment showing the speedup of BetterTransformer transformation using `distilbert-base-uncased` on a NVIDIA T4 GPU, in half precision. The speedup is computed with respect to the padding percentage (on the x-axis) and each graph represents an experiment with a specific sequence length*

And the speedup results on a NVIDIA A100 using bert-large-uncased in half precision in the graph below:

Speedup results on `bert-large-uncased` model using a NVIDIA A100. *The speedup is computed with respect to the padding percentage (on the x-axis) and each graph represents an experiment with a specific sequence length*

For more detailed numbers, please check the report for bert-large-uncased and the report for bert-base-uncased.

Note that a few encoder-decoder models (Whisper & BART) are currently supported but the speedup of these models are not considerable when it comes to generation tasks because most of the operations are done on the decoder side.

As stated in this blogpost, text encoder-based models can benefit from interesting speedups, this includes AlBERT, BERT, CamemBERT, RoBERTa, xlm-RoBERTa. Note also that sparsity can be also leveraged in a CPU setting, check our example Colab notebook for more details. What about other modalities?

We have benchmarked the speedup on several models of different modalities (image and audio).

For Vision models (ViT, DeiT, YOLOS) speedup is observed on GPU only (the notion of padding token is not possible to define for Vision-based transformers), therefore models can benefit only from base speedup.

For small batch sizes, we saw up to a 3x speedup for google/vit-huge-patch14–224-in21k model. Note that the speedup decreases with respect to the batch size as the we would not be able to leverage the optimizations with padded tokens

*Speedup of `BetterTransformer` integration on* *`google/vit-huge-patch14–224-in21k`* *model (using float16) on a NVIDIA A100 GPU with respect to the batch size.*

For audio models, speedup is observed on feature extraction tasks. It decreases for larger batch size, which can be explained by the large sequence length used for audio processing.

*Impact of using `BetterTransformer` on* *`facebook/wav2vec2-base-960h`* with respect to padding percentage and batch size of the input sequences. Speedup was measured on feature extraction (not audio generation) on a NVIDIA A100 GPU with a sequence length 8000 using half precision (float16)

How to contribute and add support for more models

The integration of BetterTransformer with Hugging Face currently supports some of the most used transformer models, but the support of all compatible transformer models is in progress. You can find a table of supported models, and models that can potentially be supported in the Optimum documentation for BetterTransformer.

To extend the support to a wider range of models, we welcome your contributions! We put up a tutorial to guide you through the steps to add supported models, make sure to check it out!

Future work

For next PyTorch releases, users will benefit from exciting additional features, including support for decoder-based models as well as even faster attention using FlashAttention and xFormer kernels. These features would make these models even faster! You can already try your hands at these features by downloading the latest nightlies of PyTorch.

Since these custom fused kernels are planned to be integrated in a future PyTorch release, you can expect more speedup in the future PyTorch version. The next release will also probably include the support for decoder-based models, which will make models such as BLOOM or OPT much faster! Stay tuned…

Acknowledgements

Hugging Face team would like to acknowledge the entire PyTorch team responsible for building this amazing tool and helping us having a smooth integration, a big thanks to: Eric Han, Hamid Shojanazeri, Christian Puhrsch, Driss Guessous, Michael Gschwind and Geeta Chauhan.