Optimizing Transformers Models for Peak Performance

Jeremy K
𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨
7 min readOct 31, 2023

With each passing day, a fresh wave of AI models emerges, surpassing their predecessors in both intelligence and intricacy. This advancement isn’t confined solely to large language models with billions of parameters; it resonates across various domains. For instance, in the realm of computer vision, behemoths like the 80-billion-parameter IDEFICS model, tips the scales at an astounding 160 GB, exemplifying this trend.

Yet, for many users, a significant hurdle arises: gaining access to the potent GPUs necessary for running these state-of-the-art models can be a real challenge.

The encouraging news is that Transformers offers an array of clever solutions. These optimizations present a pathway to harnessing the cutting-edge capabilities of these models without incurring exorbitant expenses on servers or cloud services. In the upcoming sections, we will delve into the strategies to run models faster and not just harder.

Fine-tuning precision levels: balancing accuracy and speed

AI models operate at varying levels of precision, a key factor in computations. Higher precision, like the standard 32 bits (single precision), ensures accurate results during inference. Conversely, lower precision, such as 16 bits (half-precision), accelerates the inference process significantly.

The art lies in striking a balance between faster inferences and increased accuracy. While higher precision offers greater numerical accuracy, it demands more memory and computational resources. On the other hand, lower precision can lead to memory and speed advantages, making it particularly valuable in resource-constrained environments.

It’s worth noting that the choice of precision may depend on the specific hardware capabilities. Some older GPUs, for instance, may not support certain lower precision modes. Therefore, understanding the intricacies of your hardware is essential when determining the optimal precision for your AI models.

  • Model size and GPU memory

The choice of precision significantly impacts GPU memory usage, as illustrated in the table below featuring the Whisper large v2 model:

Model size and GPU memory used for single, half and 8-bit precision

For each precision, the model is loaded in a specific way:

#Single precision
model = AutoModelForSpeechSeq2Seq.from_pretrained(“openai/whisper-large-v2”)
#Half-precision
AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-large-v2", torch_dtype=torch.float16)
#8bit precision
model = AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-large-v2",load_in_8bit=True)

To implement 8-bits precision, installing additional packages is necessary:

pip install accelerate bitsandbytes scipy

Reducing precision immediately cuts down model size and GPU memory usage, democratizing access to large models. However, the question remains regarding performance gains.

  • Processing time and accuracy

Building on the methodology detailed in a previous benchmarking study, we’ll compare the computational time of Whisper large v2 across multiple precisions, along with Word Error Rate (WER).

For this test, we took 1000 samples from the common voice dataset in English, yielding the following results:

Comparison of WER — Whisper large v2

Overall, the WER is not really impacted by a lower precision. Surprisingly, it even improves but this is mostly due to the fact that with lower precision, Whisper had less hallucinations that have a significant impact on the WER.

Comparison of processing time — Whisper large v2

The choice of precision had a pronounced effect on processing time. Half-precision led to a remarkable 2.2-fold increase in speed. However, intriguingly, 8-bit precision performed even slower than single precision. For an in-depth explanation of this observation, refer to this informative article link.

A crucial note regarding half-precision: the input features of the model must also be cast to a half-precision tensor, achieved as follows:

input_features = processor(audio, sampling_rate=16000.0, return_tensors="pt").input_features.to(device).to(dtype=torch.float16)

In summary, opting for a lower precision can result in a notable performance boost without compromising inference quality. It’s important to acknowledge, however, that as of now, 8-bit precision primarily serves to reduce GPU memory consumption rather than accelerate the underlying model.

Faster models with BetterTransformer

BetterTransformer stands as a potent optimization offered by the Optimum module, designed to enhance transformers models. By leveraging techniques like sparsity and fused kernels, more particularly flash attention, BetterTransformer achieves notable speedups on both CPU and GPU. It’s important to note that while BetterTransformer brings substantial benefits, not all models currently enjoy its support.

Implementing BetterTransformer is a seamless process:

from optimum.bettertransformer import BetterTransformer
model = CLIPModel.from_pretrained(“openai/clip-vit-base-patch32”).to(device)
model = BetterTransformer.transform(model, keep_original_model=False)

Initiating the model with BetterTransformer involves loading your model first, followed by integrating the BetterTransformer version. Opting for keep_original_model=False facilitates the release of memory resources initially used by the original model.

A test was conducted with the CLIP model, extracting features from a dataset comprising 5.000 images. This yielded the following results:

Comparison Transformers VS BetterTransformer on the CLIP model

The utilization of BetterTransformer demonstrated a marked improvement in performance:

  • 18% faster on CPU
  • 9.5% faster on GPU

Additionally, we conducted a similar test with Whisper large v2 model, using the same benchmark as described above. With single precision, the WER remained unchanged while the processing time was 8.6% faster. With half-precision, the WER was slightly improved (39.2 instead of 39.35) and the processing time was 19.9% faster.

For a comprehensive list of models compatible with BetterTransformer, refer to the documentation link.

Model compilation

For computer vision models only, an additional optimization known as model compilation can be employed. This straightforward line of code is tailored to yield a remarkable speed boost of up to 30% during inference:

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
model = torch.compile(model)

In the case of CLIP, when extracting features from a set of 5000 images, the results were striking and led to a substantial acceleration, resulting in a 14.4% increase in processing speed.

Comparison of number of images processed per second

Memory management

Encountering a shortage of RAM memory when loading a model can pose a significant challenge. Transformers offer a range of techniques to address this concern.

Understanding the process

When a model is loaded, it involves several steps:

  1. The model is created with random weights.
  2. The pre-trained model is loaded.
  3. The weights of the pre-trained model are put in the random model.

A key issue arises in that both steps 1 and 2 consume RAM memory proportional to the model’s size. In the case of large models, this can potentially bottleneck available memory resources.

Sharded checkpoints

Models are often distributed as sharded checkpoints, with individual chunks typically reaching sizes of up to 10 GB. To mitigate RAM consumption, reducing the size of these checkpoints can be very beneficial.

In practice, when loading the pre-trained model during step 2, each shard of the checkpoint is loaded sequentially after the previous one. This caps the RAM usage to the model’s size plus the size of the largest shard. Transitioning from 10 GB shards to smaller sizes (e.g. 2 GB) can result in substantial savings of RAM memory.

Here’s how to proceed:

#Load the model
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b")
from transformers.modeling_utils import load_sharded_checkpoint
#Save it as shards of 2GB max
model.save_pretrained('blip2-2GB', max_shard_size="2000MB"

Subsequently, you can load the model from your local folder with 2GB-sized shards.

Low CPU memory usage

Another viable approach to streamline model loading is by leveraging the low_cpu_mem_usage parameter, provided by the accelerate package (ensure it's installed beforehand). This option, implemented during step 1, initially loads an empty model. It is then progressively loaded shard by shard during step 3, constraining RAM memory usage to the model's full size.

model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0pp", low_cpu_mem_usage=True)

In conclusion, several options are available to optimize RAM memory usage when loading a model.

Combining everything together

Ultimately, it’s imperative to integrate all the techniques showcased above. This comprehensive approach aims to not only curb RAM and GPU usage but also attain swifter inferences.

Through experimentation with CLIP, it was observed that the most refined version operated at a remarkable 1.55 times the speed of its non-optimized counterpart.

Keep in mind that the impact of these optimizations varies depending on the model being utilized. For example, employing Whisper half-precision alone resulted in a notable 2.2-fold acceleration compared to the single precision version. To determine the most effective combination, it’s imperative to undertake a thorough benchmarking study using a suitably extensive dataset.

Additionally, ensuring the quality of inferences remains stable throughout the optimization process is of paramount importance.

Conclusion

In the dynamic field of AI, achieving peak performance is crucial. Precision adjustments, specialized optimizations, and memory-saving techniques provide powerful tools. However, their impact varies based on models, necessitating meticulous benchmarking. By integrating these strategies, users can strike a balance between resources and accelerated inferences, propelling AI applications to new heights.

As this field is constantly evolving, I encourage you to stay updated by exploring the articles below and adapting your code to incorporate the latest advancements.

--

--

Jeremy K
𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨

Innovation specialist, AI expert, passionate about OSINT and new technologies. Sharing knowledge and concrete use cases matter to me.