TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Speeding up BERT Inference: Quantization vs Sparsity

--

Photo creds Joshua Hoehne from Unsplash

Intro

Recently, Transformers and Transformers-like architectures have taken over as de-facto state-of-the-art of NLP. A great example is BERT. BERT and its various cousins such as RoBERTa and AlBERT produce an embedding from a sequence of text. The embedding can then be used in a variety of downstream tasks, such as classification, semantic similarity or Q&A, achieving near human-level performance in some of them.

A big problem with BERT (and state-of-the-art NLP in general) is that this great human-level does not come for free. It typically comes in the form of long-latencies for your customer and a massive AWS bill each month.

Numerous efforts have tried to address this challenge. Batching queries, allowing flexible sequence lengths and smart client/server work partitioning can go a long way. But are there ways to speed up the actual BERT inference itself? I am going to assume that we are dealing with a CPU backend in this post, which is by far the most common scenario.

Using the right library

The first step might be to switch out of Tensorflow or Pytorch into a better free library of Onnx or OpenVINO. Depending on your Tensorflow/Pytorch version and particular hardware, this step could present the largest savings out of everything that I talk about here. The popular Huggingface library is continuously integrating with Onnx so check out best practices there.

This means that you should be aware of commercial tools who claim they can improve inference speed against Tensorflow/Pytorch but don’t mention Onnx or OpenVINO benchmarks! Ideally, you’d want to also check out the Onnx/OpenVINO versions, as only the later versions include Transformer-targeted optimizations.

Quantization

Assuming you are now running Onnx or OpenVINO, how do you push performance further? The first thing to try might be quantization. This simply means replacing the floating point weights in your model with int8 weights. This can typically save a lot of memory space, but this might not end up saving too much execution time!

This unfortunate fact is because until the introduction of AVX512-VNNI, Intel (and AMD) CPU’s vector units could not operate natively on int8 data, at least not in a way that’s useful for deep learning inference. The vast majority of cloud CPUs on AWS currently do not support AVX512-VNNI. The only ones that are start at c5.12xlarge, which might not offer you a lot of flexibility in terms of cost planning.

For example, executing BERT-base on a single core with c5.2xlarge, quantization only resulted in 25% speedup with Onnx. Contrast this to an AVX512-VNNI core on a c5.12xlarge, where the speedup was around 250%.

A benefit of quantization is typically you only lose less than 1% in accuracy. It’s also well integrated into most deep learning frameworks, so it’s easy to try: https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/dynamic_quantization_bert_tutorial.ipynb.

Pruning

An alternative to quantization is pruning. Pruning introduces zeros (aka sparsity) in the weight matrices, promising both memory and compute savings. For example, a recent work by Huggingface, pruneBERT, was able to achieve 95% sparsity on BERT while finetuning for downstream tasks. Another promising work from the lottery ticket hypothesis team at MIT shows that one can obtain 70% sparse pre-trained BERTs that achieves similar performance as the dense one for finetuning on downstream tasks. Tensorflow and Pytorch both offer support for playing around with pruning.

However, getting speedup from pruning is even more challenging than quantization, since CPUs don’t like sparse computation very much. Indeed, last time that I checked, Pytorch’s sparse matrix dense matrix multiplication is only faster than the dense-dense version if the sparse matrix contains more than 98% zeros! Typically, one can afford at most 90% sparsity or maybe 95% sparsity without losing too much accuracy.

Recent solutions, such as OctoML’s TVM, have started to tackle the sparse inference problem: https://medium.com/octoml/using-sparsity-in-apache-tvm-to-halve-your-cloud-bill-for-nlp-4964eb1ce4f2. Although only a comparison to Tensorflow was given, a near 2x speedup on pruneBERT seems fairly promising. Unfortunately, it only seems to work for AMD CPUs, possibly because it is not optimized for AVX512 specific to Intel CPUs.

Neuralmagic is a MIT startup that specifically accelerates sparse neural networks. While their reported performance is great, unfortunately they currently only support computer vision models.

I am going to add an advertisement here for my library, SparseDNN, that (I believe) offers the best sparse inference performance on the market right now for BERT-like models: https://arxiv.org/abs/2101.07948. SparseDNN offers 5x speedup for pruneBERT, and works for both Intel and AMD CPUs. SparseDNN also offers speedups for popular computer vision networks like ResNet and MobileNet.

Of note, currently no library can take advantage of both quantization and pruning. (Please comment if you know one.) SparseDNN offers experimental support, but its sparse INT8 kernels are only marginally faster than the floating point ones.

The Bottom Line

In this article, I covered a couple of ways to improve the performance of neural networks in order of increasing difficulty, using BERT as an example. How should one decide which method to employ in practice? It all depends on the accuracy-speedup tradeoff of the particular application. Intuitively, if you are willing to sacrifice more accuracy, you can speed up your neural net more.

The accuracy-speedup tradeoffs for several of the methods mentioned in this article as applied to BERT are plotted above. The set up assumes we are using a single CPU core without AVX512-VNNI. Ideally, you want to sit in the lower-right corner, with low accuracy loss and high speedup. The green line is the Pareto-optimal frontier of optimization options.

This article is by no means meant to be an exhaustive guide to neural network optimization. For example, quantization is not limited to int8, and I didn’t even cover structured pruning. New hardware options such as AWS Graviton and Inferentia also offers interesting architecture-dependent tradeoffs. But hopefully, it gives you some starter ideas and a mental framework to compare different optimization methods.

Github repo for SparseDNN: https://github.com/marsupialtail/sparsednn

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Ziheng Wang
Ziheng Wang

Written by Ziheng Wang

I make CPUs, GPUs, FPGAs and my bike go fast.

Responses (1)