How to compute LLM embeddings 3X faster with model quantization

Published in

Nixiesearch

10 min readNov 13, 2023

Running LLM embedding models is slow on CPU and expensive on GPU. We will make it up to 3X faster with ONNX model quantization, see how different int8 formats affect performance on new and old hardware, and go even further with doing ONNX transformer optimization on top of the quantized model.

The vector search problem nobody talks about

To perform a semantic search with LLM embeddings, you must first compute these embeddings. With many vector search databases in the market, computing embeddings is unfortunately considered an afterthought and out-of-scope problem.

Embedding inference is a primary step of any semantic search system. Image by author.

We are working on a new open-source search engine, Nixiesearch, which can fine-tune embeddings to your data. As we handle embeddings on the server side, we were not surprised to see the tremendous performance impact of running embeddings on a CPU.

A flame graph of indexing process in Nixiesearch with the e5-small-v2 embedding model. Image by author.

The flame graph above shows that 95% of CPU time is spent on computing embeddings. For sure, you can make it much faster by switching the indexing to GPU, but it’s still a major bummer for people wanting to try Nixiesearch. Can we make it faster while still staying on CPU?

Model quantization

All the present deep neural networks are just glorified combinations of matrix operations.

Attention layer matrix operations. Image *from ‘Attention Is All You Need’ by Vaswani et al.*

As shown in the diagram above, the attention layer from a transformer network is just a combination of trivial algebraic transformations on top of matrices. In a classical implementation, these matrices contain 32-bit float values. What if we are ready to trade a bit of precision for better performance, reducing the float size from 32 bytes down to 8?

32-bit float matrix multiplication. Image by author.

This approach is called quantization: reducing the numerical precision of matrices used to store model weights and neuron activation values.

Going from 32 to 8 bits of precision will reduce RAM size needed to store model weights 4x and hopefully make it faster with modern CPUs.

But you cannot just fit a 32-bit float to a 8-bit integer without loss: with only 8 bits of storage per number, you can encode only 256 distinct values!

A weight distribution in trained TinyBERT. Image from KDLSQ-BERT: A Quantized Bert Combining Knowledge Distillation with Learned Step Size Quantization by J. Jin et al.

But weights and activation values inside the neural network are not random numbers! As you can see from the histogram above, they are usually close to zero and normally distributed. With this observation we can replace all the network operations with quantization-aware ones — tracking also zero_point and scale of the underlying numerical distribution.

Operator-based quantization in ONNX. Image by author.

This approach is called operator quantization and is considered the most straightforward approach to the problem. Another option would be a QDQ quantization where you still use the same 32-bit operators but inject a quantize-dequantize pair of operators before a regular one — which is usually much slower.

Operator vs QDQ quantization. An image from ONNX documentation — Quantize ONNX Models.

In the image above, you can see extra nodes injected into the graph in the QDQ mode, which usually results in worse performance:

Additional Quantize-Dequantize operators are not free.
The primary operator runs on Float32 data as before, so you only save on RAM/VRAM usage, not performance.
QDQ is only supported for static quantization in ONNX runtime — see the next chapter for details.

For the sake of simplicity we will target only operator quantization in this article.

Note that model quantization differs from embedding quantization supported in all major vector search engines like Elasticsearch, Qdrant, Vespa, and Weaviate. The quantized model still emits Float32 embeddings as before — it just uses a more compact layout for weights and activations.

Dynamic vs static quantization

There are two options for computing zero_point and scale quantization parameters for each operation in the graph:

Dynamic: each operator re-computes these parameters during runtime for each batch. It is more resource-heavy but also more precise — these parameters may drift between batches.
Static: instead of re-computing these parameters each time for every batch, we do an offline calibration — feed-forward the network with a tiny dataset and record quantization parameters statically based on distributions observed. There is no extra overhead, but as quant parameters are static, they may be imperfect in a case of drift between batches.

The main drawback of static quantization is the need to perform calibration, which is an extra manual step. Since static quantization typically leads to worse precision, we will focus on the dynamic quantization approach in this article.

LLM inference runtime

Model files you find on HuggingFace Hub contain only a definition of matrix operations (an execution graph) you need to execute on top of model weights.

Execution runtime is the thing doing the actual matrix multiplication. Image by author.

It is the execution runtime that interprets and executes this graph on top of your hardware:

PyTorch and TensorFlow — used for model development, very Python-centric, but there are low-level bindings for other languages like C/C++. In practice, both are more optimized for training and batch processing on GPU.
OpenVINO — a CPU-focused runtime from Intel, Python/C++ only.
ONNX — an open multi-language (Java/JS/WASM/C++/Python) and multi-backend (CPU/GPU/TPU) runtime.
TensorRT — an ONNX-compatible runtime from Nvidia specialized only in GPU execution.

Tied to Apache Lucene, Nixiesearch is implemented as a JVM application — and as other open-source search engines chose to use ONNX as the primary execution runtime for neural networks.

Converting models to ONNX

You need to convert in first to execute a model inside the ONNX runtime.

ONNX model conversion flow. Image by author.

For such conversion, we developed a tool nixiesearch/onnx-convert, which is an extended version of a conversion script from the original xenova/transformers.js project. The converted model files are compatible with any ONNX-flavored search engine and can be used in Elasticsearch Inference Processor, Vespa Embedder and directly in Nixiesearch.

nixiesearch/onnx-convert conversion tool. Image by author.

The ONNX conversion+quantization process has multiple important tunable parameters affecting both performance and quality:

Underlying quantization format: it can be a signed/unsigned 8-bit integer and 16-bit float. How does this affect performance and precision?
ONNX transformer optimizer: ONNX can fuse multiple typical operators into a single optimized core for things like attention blocks. Does it really matter, and how does its level affect end inference latency?

Apart from these variable parameters, we fix other less important ones to recommended constant values:

ONNX opset: ONNX has multiple versions with different support for data types and operators. We chose the latest opset=17 supported by the Python onnx package.
Per-channel quantization: should the graph track a single per-tensor scale and zero_point values, or should it be done per channel, a slice in the tensor? Turning it off may slightly increase performance and decrease precision — but with a cost of extra memory used, we chose to enable it by default.
7-bit scale: as extra protection from numeric overflows, should 8-bit values be shrunk a bit? ONNX docs state that it may improve precision on older hardware without AVX-VNNI.

We will take an E5-v2 family of embedding models, taking small, base, and large variants to see the impact of each change on models of different sizes.

QUint8/QInt8/Float16 and inference latency

We convert the e5-small-v2, e5-base-v2 and e5–large-v2 to QUint8, QInt8 and Float16 numerical types, still without any optimization yet:

And run an embedding-benchmark suite:

For each model of each size, compute embedding latency using onnxruntime in JVM.
The suite is run on the last generation of AWS M7I.2xlarge instance with 8 VCPU supporting AVX-VNNI.

Relative inference time improvement between Float32 and other formats. Higher is better. Image by author.

The table above reads, “for e5-small-v2 model the QUInt8 format for 4 tokens is 1.38X faster than Float32”. And the results we got are pretty surprising:

The good news is that we got more than 3X faster inference for base and large models!
There is little difference between QUInt8 and QInt8 on VNNI hardware.
The mixed-precision Float16 format was unexpectedly 2x-7x SLOWER than the baseline. This surprising fact comes from the lack of FP16/BF16 support in modern CPUs: only the latest Intel Sapphire Rapids Xeon CPUs with AMX support can handle these data types natively. CPUs without AMX do downcast-upcast operations each time they spot a Float16 type.

So you obviously should quantize your model, but what if your hardware has no VNNI support?

Impact of AVX-VNNI

We’re explicitly mentioning AVX-VNNI support in CPU for a reason. While writing this article, the author was stuck with not being able to replicate performance numbers seen in the article “Accelerating Transformer-based Embedding Retrieval with Vespa” on model quantization. The root cause was the AMD Zen2 CPU not supporting the VNNI instruction set.

VNNI is an AVX extension for Vector Neural Network Instructions and supported on Intel CPUs made from 2019+ and on AMD Zen 4+:

VNNI CPU support. Image from https://en.wikichip.org/wiki/x86/avx512_vnni.

On a non-VNNI AMD Zen2 CPU, the results are quite different:

Non-VNNI CPU relative inference difference with Float32 baseline. Lower is better. Image by author.

There are two main important observations:

For the QInt8/QUInt8, the speedup of 1.2x-1.6x is less dramatic than for VNNI CPU.
Compared to VNNI CPU, there is a difference between QInt8 and QUInt8: signed QInt8 is almost always around 15% faster than unsigned QUInt8.

So, even without VNNI support, the latency improvement of quantization is still worth it.

Optimizing the model

All the results above were made with 1–1 PyTorch to ONNX translated models. But ONNX has a set of specially fused kernels for transformer LLMs, like QAttention. A fused kernel is a group of matrix operations performed together, usually in a heavily-optimized way.

A chain of matrix ops replaced with a QAttention block implemented as a single BLAS GEMM method. Image by author.

So instead of performing multiple separate operations like MatMul-Scale-Mask-etc with many intermediate tensors, you can just replace them with a single QAttention block, which is technically a single BLAS GEMM function call.

ONNX has 4 optimizer levels:

0: perform 1–1 translation without any optimization.
1: do graph-only fusion like QAttention described above.
2: as 1, but also apply fusion to Python-only blocks.
99: as 2, but with RELU approximation.

Stacked relative improvement, on top of each level. Image by author.

The table above can be read as “for QInt8 format with 64 token input, optimization with level 1 gives +33% faster inference, and level 2 adds extra +4% on top of level 1”.

The main observations are:

ONNX optimization makes a difference: QInt8/QUInt8 became ~30% faster! There is still ~10% improvement, even for the non-quantized model.
Level 1 significantly improves latency, and level 2 almost always offers an extra 3–4% on top.
Level 99 slightly degrades performance by 1–2% compared to level 2.

So, you should also perform ONNX optimization as it makes your quantized model even faster with no extra cost.

Embedding quality on BEIR/MTEB

As the quantization process is not lossless, how does it affect embedding quality? We will use a subset of the BEIR-MTEB reference benchmark to see how quantized models degrade in quality.

NGCG@10 on 3 BEIR datasets. Float32 is a baseline without any quantization. Only small model and only three datasets due to too long running time of quantized models on CPU. Image by author.

There is a couple of surprising conclusions we can make:

Float16 quantization has almost no degradation on these three datasets. But considering it’s severe performance implications while running on CPU, it’s not a good option.
QUint8 and QInt8 give slightly worse results, but the drop can still be considered tolerable, given how fast they are.
Per-channel quantization and 7-bit scaling are two essential parameters: not using them during quantization often results in a significant drop in precision.

So, it would be best if you were extra accurate while choosing parameters for quantization: doing it wrong may result in a severe quality penalty. We advise using the onnx-convert tool to save yourself from shooting in the foot.

Conclusion

So if you’re looking for a way to improve the inference latency of your embedding models, you should definitely consider ONNX quantization and optimization combo. Here are absolute before-and-after numbers we got for e5–base-v2 model:

Absolute inference numbers in milliseconds, Float32 with optimize=0, QInt8 with optimize=2. Image by author.

With such an approach, we could go down from 50ms to 15ms for long documents on the e5-base-v2 model, which is a 3.5x improvement in latency!

Model quantization is an excellent way of improving the inference latency of embedding models:

2x-4x inference improvement is possible, which comes at a price of minor quality degradation.
Prefer the QInt8 format due to better performance on non-VNNI CPU.
Stick to ONNX optimization levels of 1 and 2 — they are a reasonable choice with no extra cost.
Prefer per-channel quantization with 7-bit scaling: this results in the most minor model quality degradation.

All the E5 models tested in this article are published on https://huggingface.co/nixiesearch in three flavors: the raw ONNX non-optimized version, optimized version, and optimized+quantized version.