ONNX — Optimization of Sentence Transformer (PyTorch) Models

M. Haseeb Hassan
3 min readJun 19, 2022

--

ONNX Optimization of complex PyTorch, TensorFlow or other models to minimze the computational time acorss different devices (CPU, GPU, etc.)
ONNX Optimization of Sentence Transformers (PyTorch) Models to Minimze Computational Time

With the advancement in Machine Learning, the models are becoming complex and utilizing great hardware capabilities with high computational time. Thus, there is a need for advancement to make these models optimized concerning the computational time across different devices. ONNX brings you a bag of options to handle complex Machine Learning/Deep Learning models (PyTorch and TensorFlow). I am going to experiment with ONNX to PyTorch conversion for Sentence Transformer models to minimize the computational time on the CPU machines.

ONNX and Sentence Transformers

I recently faced a problem with hardware requirements and computational time for a Sentence Transformer Model. After researching, exploring and experimenting, I thought to write a blog about my findings so that they can help someone.

There are a lot of options to achieve the conversion, few are listed below:

  • convert script from hugging-face
  • transformers.onnx
  • HFOnnx
  • Optimum

As per my findings, HFOnnx and Optimum libraries are easy and more convenient to use. I worked with HFOnnx (provided by txtai) and all my results are using HFOnnx. After conversion, an accelerator is needed for running the model, OnnxRuntime is a great fit for running Onnx models. HFOnnx module is integrated with Onnx Runtime. Therefore, it’s way better than other options where the accelerator is handled separately.

Convert PyTorch Models with HFOnnx

First of all, we need to install HFOnnx in a virtual environment (recommended) using pip. Follow the steps:

  • Create Virtual Environment
$ python -m venv <virtual_env_name>
  • Activate Environment
$ source <virtual_env_name>/bin/activate
  • Upgrade pip
$ pip install -U pip
  • Install txtai (Library offering HFOnnx)
$ pip install txtai

Now, we can convert the model using HFOnnx. I started with one of the most popular models BERT.

BERT

Converting the BERT is straightforward. After conversion, an extra step is needed for pre-processing of the input i.e. Tokenizer. Here’s the chunk of code required for conversion and implementation:

# ====================================
# # Importing Modules
# ====================================
from txtai.pipeline import HFOnnx, Labels
from transformers import BertTokenizerFast
# ====================================
# # Tokenizer
# ====================================
tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased")# Preparing Input
model_inputs = tokenizer(input_sentence, return_tensors="np")
# ====================================
# # HFOnnx
# ====================================
# Initializing HFOnnx
onnx = HFOnnx()
# Converting Model
onnx_model = onnx(
"sentence-transformers/all-MiniLM-L6-v2",
"pooling",
"embeddings.onnx",
quantize = True)
# ====================================
# # Implementation
# ====================================
output_onnx = onnx_model.run(None, dict(model_inputs))

The model was converted and tested on a CPU (AMD Ryzen 7 3700U with Radeon Vega Mobile Gfx). Following is the comparison between PyTorch and ONNX for bert-base-cased (Sentence Transformer):

ONNX Optimization of complex PyTorch, TensorFlow or other models to minimze the computational time acorss different devices (CPU, GPU, etc.)
Figure 1 — PyTorch and Onnx Computational Time Comparison for BERT (bert-base-cased)

Sentence Transformer (all-MiniLM-L6-v2)

Now, let’s try with a sentence transformer model (all-MiniLM-L6-v2). This model is used to produce embeddings for the sentences. The conversion is the same as that of BERT except for the path to the model. Following are the modified lines:

import sentence-transformersonnx_model =  onnx(
"sentence-transformers/all-MiniLM-L6-v2",
"pooling",
"embeddings.onnx",
quantize = True)
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

The Hugging Face provides different tokenizers that can be used as per the requirements. Following are the comparison results for all-MiniLM-L6-v2:

ONNX Optimization of complex PyTorch, TensorFlow or other models to minimze the computational time acorss different devices (CPU, GPU, etc.)
Figure 2 — PyTorch and Onnx Computational Time Comparison for all-MiniLM-L6-v2

The computational time is significantly reduced in the case of ONNX for the CPU. The GPU trials can also show significant improvement (you can experiment with GPU as well).

Conclusion

HFOnnx provides a great way to convert the models into .onnx format and run them using OnnxRuntime. The Onnx models are optimized concerning the computational time giving the same results as PyTorch. Maybe, the output for the complex models with pooling layers and deep architecture will not be the same because of approximations in the case of Pytorch to Onnx conversion (Hypothesis). I will explore this aspect and write another blog for you, stay tuned. If you have any questions, just hit me on my Linkedin.

--

--

M. Haseeb Hassan

I write about Technology, Artificial Intelligence & Machine Learning whereas a non-tech part of me writes about Life, Darkness, Silence and Peace..