Run BERT ONNX model in FP16

David Cochard
axinc-ai
Published in
4 min readDec 11, 2023

When exporting BERT to ONNX, there are cases where inferences cannot be made in FP16. This section explains how to investigate the cause of such cases and how to correct the problem so that inference can be performed.

Overview

BERT is a model for natural language processing (aka. NLP). Using transfer learning on the original model which was trained on big data, it is possible to do a wide variety of natural language tasks with small amounts of data.

Convert BERT to ONNX

BERT can be converted to ONNX by using convert_graph_to_onnx.py included in Hugging Face’s Transformers repository.

python3 convert_graph_to_onnx.py — framework pt — model cl-tohoku/bert-base-japanese-whole-word-masking ./work/bert-base-japanese-whole-word-masking.onnx — pipeline fill-mask

Conversion Issues

Converting BERT to ONNX in Transformers 4.29.2 results in successful inference on FP32, but incorrect output on FP16.

FP32 inference results:

INFO bert_maskedlm.py (82) : Input text : I [MASK] money to live.
INFO bert_maskedlm.py (85) : Tokenized text : ['I', '[MASK]', 'money', 'to', 'live', '.']
INFO bert_maskedlm.py (97) : Indexed tokens : [1007, 4, 87, 352, 17, 167, 66, 10]
INFO bert_maskedlm.py (112) : Predicting...
INFO bert_maskedlm.py (126) : Predictions :
INFO bert_maskedlm.py (130) : 0 need
INFO bert_maskedlm.py (130) : 1 use
INFO bert_maskedlm.py (130) : 2 make
INFO bert_maskedlm.py (130) : 3 have
INFO bert_maskedlm.py (130) : 4 earn

FP16 inference results

INFO bert_maskedlm.py (82) : Input text : I [MASK] money to live.
INFO bert_maskedlm.py (85) : Tokenized text : ['I', '[MASK]', 'money', 'to', 'live', '.']
INFO bert_maskedlm.py (97) : Indexed tokens : [1007, 4, 87, 352, 17, 167, 66, 10]
INFO bert_maskedlm.py (112) : Predicting...
INFO bert_maskedlm.py (126) : Predictions :
INFO bert_maskedlm.py (130) : 0 [CLS]
INFO bert_maskedlm.py (130) : 1 [MASK]
INFO bert_maskedlm.py (130) : 2 [PAD]
INFO bert_maskedlm.py (130) : 3 [UNK]
INFO bert_maskedlm.py (130) : 4 [SEP]

Identifying the cause

ailia SDK allows you to dump all tensor values, not just the ones used as input or output. To investigate this kind of issue, print all tensor values to see which layers cause incorrect output. Since get_blob_data throws an error for tensors that have disappeared due to optimization, we catch these with an exception.

for i in range(0,ailia_model.get_blob_count()):
try:
data = ailia_model.get_blob_data(i)
print("Idx", i, ailia_model.get_blob_name(i), data)
except:
continue

It becomes apparent that from Blob 212 in Index 7, the value turns into NaN (Not a Number).

Idx.6 211 -3.4028235e+38
Idx.7 212 [[[[nan nan nan nan nan nan]]]]

When checked with Netron, it is found that the Bias from the Mul operator of the attention_mask is -3.40e+38, a value that cannot be represented in FP16, resulting in the output of NaN (Not a Number).

Blob 212

Therefore, upon checking the Transformers’ issues on GitHub, it is noted that in the PR dated January 20, 2022, what was previously written as F-10000 (-1e4) has been corrected to write the maximum value of either FP16 or FP32.

This appears to have been incorporated for the improvement of fine-tuning accuracy in FP16.

The original value of -10e4 seems to originate from Google’s original implementation of BERT.

IMO, the -10e4 comes from the original Google implementation of BERT and we just copied it everywhere. However I've now seen a couple of issues related to this.

Fix the issue

Having identified the cause, we can modify modeling_utils.py to ensure that the Bias of Mul fits within FP16.

extended_attention_mask = (1.0 - extended_attention_mask) * -1e4#torch.finfo(dtype).min

By making the change and exporting, it has now become possible to perform correct inference even in FP16.

ax Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.

ax Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.

--

--

David Cochard
axinc-ai

Engineer with 10+ years in game engines & multiplayer backend development. Now focused on machine learning, computer vision, graphics and AR