Optimize artificial intelligence BERT-based language apps

Intel Tech
Published in
6 min readMay 3


This open source optimization lightens transformer architecture.

Photo by Patrick Tomasso on Unsplash

Authors: Ezequiel Lanza, Imtiaz Sajwani

If speed and efficiency always matter to devs, they’re key in artificial intelligence where large models can get bogged down during the training or inferencing stages.

This happens often with natural language processing (NLP), a crucial element in every application involving human language. Machine learning model transformers are state-of-the-art for NLP because they enable machines to understand language like never before. Optimizing transformer performance is essential for ensuring that these models run quickly and efficiently.

In this post, we’ll show you a library that can optimize Bidirectional Encoder Representations from Transformers (BERT) models so they can be executed on CPUs by converting them to bfloat16. We’ll explain how these models work and why the optimized versions matter.

What weighs transformers down

Although deep learning models abound, a simple network architecture sets transformers apart. First presented in the 2017 paper “Attention Is All You Need,” they rapidly became the state-of-the-art method for NLP due to the capacity to parse language concepts that had previously been handled through other methods.

Researchers at Google* first trained BERT with about 340M million parameters. Then multiple variations began to emerge, from various generative pre-trained transformers (GPT) from OpenAI* with about 175 billion parameters (for GTP-3) to the mammoth Wu Dao* with about 1.75 trillion parameters. These numbers are impressive but carry a compute burden for both training and inference.

Training a transformer model from scratch requires specialized hardware, managing a prodigious number of parameters and dealing with vast amounts of data. That makes it almost impossible for individuals or even a single company to succeed. For example, researchers used around 3.3 billion words to train BERT and 500 billion words for GPT, allowing those models to achieve an impressive understanding of language.

Fortunately, after being trained, those generic models are now available through open source APIs (such as Hugging Face*, PyTorch* or TensorFlow*). The next step is using these trained models, which still requires fine-tuning the base models for your specific topic, given that these base models have a general understanding of the language they were trained in, such as Wikipedia*, Reddit* among others. After you’ve fine-tuned the model for your use case, you still need to overcome the inference challenge.

There are two main hurdles, due to model size:

  • Memory
    Transformer models like BERT consist of a graph with many operators. Since it’s mainly composed of stacked transformer cells, there’s an intensive memory copy between numerous elementary computations. Most methods target optimizing each cell by fusing key sub-graphs of multiple elementary operators into single kernels, including Self-Attention, Layer Normalization, and Gaussian Error Linear Unit (Gelu) layers. These can significantly reduce computation cost and memory bandwidth.
  • CPU
    Models can take advantage of multiple cores through parallelization. This happens when the Self-Attention layer of transformers architecture (Q, K and V vectors) are partitioned based on the number of Self-Attention heads. This can boost parallelization and allow the machine to fully leverage available CPU cores.

Reducing model compute requirements solves both of these problems. We’ll show you how it works in this implementation with a BERT-Large Hugging Face transformer model.

How to fix it

There are multiple ways to optimize a model, one of them is quantization. Intel provides a tool for that called Intel® Neural Compressor (INC) which you can run by yourself on your models or adopt what we’ve done with this implementation. As far as a “precision reduction,” you’ve ’probably heard about INT8, FP32 or BFloat 16, but what does it mean when applied to a model?

Bfloat 16 is a numerical data format that takes up less memory than others and is designed to provide greater efficiency in machine learning apps and the mountains of data they process. It uses 16 bits to represent a floating-point number.

From Wikipedia, Creative Commons Attribution-ShareAlike License 3.0

In deep learning, and particularly in transformer models, computations rely heavily on thousands of vector and matrix multiplications. The size of the model determines the representation of these vectors and matrices. For instance, to represent the number 1.23456789, the float32 data format uses more bits to represent it and results in a value close to the original number, whereas the float16 data format uses fewer bits and rounds the value to 1.234375. These representations are learned during model training; the goal is to convert them to less precise formats without significantly impacting the model’s performance while improving processing speed.

The good news is that you can add additional optimizations that can take advantage of the hardware, in this case CPUs. This implementation uses Intel® oneAPI Deep Neural Network Library (oneDNN). Another advantage? It requires no extra coding thanks to working with Intel® Extension for TensorFlow*.

How it works

To optimize the BERT large model (uncased) model for your app, follow the steps below.

Graphic: Rafik Saliev

The fused computation code performs the entire BERT encoder computation and self-attestation layer. It can be performed on any self-attention layers, so any BERT model can be optimized using the same operator. The single-fused operator is exposed as the TensorFlow operator, which replaces the sub-graph operators. This reduces computation and memory access.

Graphic: Bfloat16 Optimization Boosts Alibaba Cloud BERT Model Performance on 3rd Gen Intel® Xeon® Scalable Processors

Note: This optimization process will work on any model, here are further instructions.

Test it out

To try it, install the prerequisites:

  1. Build it from source:
git clone https://github.com/intel/light-model-transformer 

cd light-model-transformer/BERT

mkdir build

cd build

source /opt/intel/oneapi/setvars.sh # Make sure CMake can find oneDNN

cmake ..

cmake --build . -j 8

2. Run the benchmark:



The results below can be verified on the latest generation of Intel® Xeon® Scalable Processors now available on Amazon Web Services*. Here, we tested it on Amazon R6i instances and Amazon R7i instances

Graphic: Imtiaz Sajwani

System configuration

  • AWS instance Sapphire Rapids — r7iz.12xlarge 48 vcpu (SPR — 143), 348 GB total memory, bios: Amazon EC2 1.0, microcode: 0x2a000080 , Ubuntu 11.3.0–1 — ubuntu1~22.04, 5.4.0–1068-aws, workload, benchmark: Open MP Threads , Python 3.9.13, TensorFlow 2.9, OneDNN 2.7, Batch size :1,2,4,5, precision : FP32.
  • AWS instance Ice lake -r6i.12xlarge 48 vcpu (Ice Lake, Intel(R) Xeon(R) Platinum 8375C ), 348 GB total memory, bios: Amazon EC2 1.0 , microcode: 0xd000363 , Ubuntu 11.3.0–1 — ubuntu1~22.04, 5.4.0–1068-aws, workload, benchmark: Open MP Threads , Python 3.9.13, TensorFlow 2.9, OneDNN 2.7, Batch size :1,2,4,5, precision : FP32.
    Tested by Intel in September, 2022.


We’ve used this optimization to get better performance for our customers with their NLP apps running transformers architecture. Enabling them to use CPU instances can help reduce costs of buying expensive GPUs. Make sure to pin or star the GitHub* repo for the light-model transformer for notifications or to contribute to the project.

The authors thank Krzysztof Piotr Chutkiewicz, Rafik
Saliev and Mikolaj Zyczynski for their contributions.

About the authors

Ezequiel Lanza is an open source evangelist on Intel’s Open Ecosystem Team, passionate about helping people discover the exciting world of AI. He’s also a frequent AI conference presenter and creator of use cases, tutorials, and guides to help developers adopt open source AI tools like TensorFlow* and Hugging Face*. Find him on Twitter at @eze_lanza

Imtiaz Sajwani is a Cloud AI/ML Software Architect at Intel with expertise in embedded systems and design experience in Application-Specific Integrated Circuits (ASIC), Field Programmable Gate Arrays (FPGA) and Register Transfer Levels (RTL).

Notices & Disclaimers:

Intel technologies may require enabled hardware, software or service activation.

Your costs and results may vary.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.



Intel Tech

Intel news, views & events about global tech innovation.