Model compression and optimization: Why think bigger when you can think smaller?

Published in

Data Science at Microsoft

12 min readSep 28, 2021

By Daniel Ferguson and David Williams

*Models don’t need to take so long to run.*

In recent years the push for better neural network models has resulted in larger model sizes. We’ve seen model parameter counts grow from the millions to the billions, with associated growth in deployment costs. While these models are “better” in that they beat others’ benchmarks on various natural language processing (NLP) tasks, the question must be asked: What if a better model means a smaller model that performs just as well?

The perspective of creating just-as-good smaller models can be referred to under the blanket term model compression. Compression has several benefits for a large class of neural network models, but its primary goal is to introduce Machine Learning (ML) techniques that shrink a model’s size while maintaining the model’s accuracy and reducing the model’s inference speed. In this article, we explore various techniques of model compression — specifically ONNX conversion, quantization, and distillation — and additionally address where using these techniques fits within the Machine Learning lifecycle.

When to consider model compression

Before blindly implementing model compression techniques, it’s a good idea to check whether they will accomplish your goals. Some key indicators that compression techniques may be of benefit include:

Models are running in real time
Initial models are large
Server cost for model inference are high

If any of the above factors are bottlenecks to utilizing a neural network for a task, model compression may resolve them.

First things first

To start, we must understand where within the lifecycle of Machine Learning the model compression step is to occur. Once this is identified, we can implement various model compression techniques, including ONNX conversion, quantization, and distillation.

The ML lifecycle

The classic ML lifecycle focuses on the cycle of data acquisition, model training, and model deployment. In between steps 5 and 6 as shown in Figure 1 (step 5.5 perhaps) is the opportunity to add a step for implementing model compression techniques. Let’s explore the costs and benefits of this extra step.

A diagram of the machine learning lifecycle — *Figure 1: The Machine Learning lifecycle.*

Costs of compression

Even when it’s beneficial, compression is not free. Costs of implementing it include:

Increased deployment complexity: After implementing various model compression techniques there is more to keep track of, namely the original trained model and the compressed models. We must choose the model to deploy and spend time making this choice.
Decreased accuracy: Some model compression techniques result in a loss of accuracy (however this is measured). This cost has an obvious counterpart in that the benefits of the model compression technique may outweigh the accuracy loss.
Compute cost: While model compression reduces the compute resources required for inference, the compression itself may be computationally expensive to perform. Notably, distillation introduces an additional iterative training step.
Your time: Adding a step to the lifecycle requires an investment of your time.

Benefits of compression

The primary benefit of compression involves reduced compute costs during inference: The computational resource reduction is the primary motivator for performing model compression. Model compression reduces CPU/GPU time, memory usage, and disk storage. It can make a model suitable for production that would have previously been too expensive, too slow, or too large.

Our motivation and problem

Every model compression approach that we employ in our work generalizes to most deep neural net models and across GPU/CPU architectures. Within this article, however, we focus on model compression techniques for a single NLP model operating on Tesla V100 GPUs as an example system.

Problem statement

The problem that motivated us stems from information retrieval in clinical documents. Given a patient’s electronic medical record, we want to suggest possible answers to a set of clinical questions by highlighting parts of the record. This small amount of pre-markup assists the clinician in answer extraction. Our goal for this work is to answer ten questions concerning 1,000 twenty-page documents every 30 minutes. Because this task can be framed as a classic “QnA” problem, we’ll work with the SQuAD dataset as an example task. We’ll choose a popular member of the BERT family, roBERTa-base-uncased-squad as hosted on HuggingFace, as our example model. This model takes as input a question/paragraph pair and outputs the section of the paragraph potentially containing the question’s answer.

The out-of-box Torch model runs on a GPU at an average speed of 0.0175 seconds per question/answer pair. In our motivating problem, a document consists of 20 pages and we ask five questions of each document. Supposing that there are approximately ten sentences per page, the total inference time for a single document is 17.5 seconds. To reach our target of 1,000 documents every 30 minutes we would need, on average, ten GPUs, each running the same model operating in parallel.

We can do better than this! Let’s see how few GPUs we need after applying some model compression techniques.

Techniques of model compression

This isn’t an exhaustive list of model compression techniques but a starting point with linked examples.

ONNX conversion and ONNX Runtime

ONNX is an open format that is used to represent various Machine Learning models. It works by defining a common set of operators and a common file format to enable data scientists to use models in a wide variety of frameworks. The conversion process for natural language models from (insert your favorite neural network library here) to ONNX additionally functions as a model compression technique. This is because the operators defined by ONNX have been optimized for specific types of hardware, resulting in slightly smaller models.

Figure 2: Converting a model from PyTorch to ONNX.

*Figure 3: Impact of ONNX/ONNX Runtime on model size, average runtime, and accuracy.*

The true utility of ONNX comes in the form of the ONNX Runtime backend. One of the optimizations with the most impact that ONNX Runtime implements is the capacity to “fuse” operations and activations within a model. The result of this fusion is a significant reduction in memory footprint and calculations per inference. For popular NLP model families, there exists customized logic to identify the operations within the models that can be fused. And now for the crowning accomplishment: Converting a model can be done with only a few lines of code (see Gist above).

The effect of this conversion is a significant speed increase for our model with no impact on our target accuracy metric. However, the reduction in model size due to ONNX is negligible.

Quantization

Our next approach, quantization, is the process of mapping values from a large set to a smaller set. Rounding and truncation are both basic examples of quantization but aren’t how quantization manifests in the realm of neural networks.

Neural nets, in most default configurations, have weights stored as 32-bit floating point numbers (fp32). Operations with fp32 numbers are expensive and most hardware is not optimized to compute with them.

The most common quantization process takes fp32 numbers and reduces them to 8-bit integers (int8). The result is a model with a quarter the size that can perform inference at nearly four times the original speed. These benefits are at the cost of a loss in precision in the output of the model. Whether this loss in precision affects the target metric for the model is task and model dependent. Typically, when models have discrete outputs, such as identification of a handwritten digit, this precision loss has less effect.

This form of quantization comes with a catch. Moving from fp32 to int8 is most beneficial for models inferencing on the CPU. But what if, as we do, you run your inference on a GPU?

In the case of models running on a GPU, int8 quantization can still be performed, though it is not widely supported. As a consequence, models on a GPU are usually not quantized! Not all hope is lost, however, because there is support within the ONNX library to convert a model from fp32 to fp16, which is a form of quantization.

*Figure 4: Impact of quantizing an ONNX model (fp32 to fp16) on model size, average runtime, and accuracy.*

Representing models with fp16 numbers has the effect of halving the model’s size while (approximately) doubling the inferencing speed. This is similarly at the cost of a loss of precision which, as discussed earlier, may or may not affect the model’s metric for accuracy. As an added benefit, most GPUs are optimized to operate with fp16 numbers very efficiently.

Distillation

Distillation is one of the most powerful approaches when it comes to model compression. Implementing a state-of-the-art distillation process can cut your model size down by a factor of seven, increase inference speeds by a factor of ten, and have almost no effect on the model’s accuracy metric (see tinyBERT).

Here is more good news: Distillation is still fairly young! There are likely many improvements to come. Now for some bad news: Distillation is still fairly young! This means that the process is not yet widely implemented in standard libraries. Research code does exist that can be used to distill various model architectures (such as BERT, GPT2, and BART), though to implement distillation on a custom model it is necessary to understand the full process.

The student-teacher paradigm

The idea of having a teacher to help a student learn is perhaps the most popular knowledge transfer paradigm that exists. It works under a simple principle: “As a student it is easier to learn a new subject given some guidance from a knowledgeable teacher.” This is how nearly all knowledge is imparted to students in any education system. Notably, however, this is not the only way to learn a new subject. Within scientific research, for example, new information is learned through research every day, even though there is no “teacher” to impart the knowledge.

To set up this paradigm for neural networks, we must identify a Teacher model and a Student model. For our case the Teacher model will be the roBERTa model we have been using. The Student model will be identical to the Teacher model but with some hidden layers removed. In our case, we will have a Teacher with 12 hidden layers and a Student with six hidden layers. Now we must connect the Student and Teacher models to train the one from the other. To do this we must first understand what information is captured by the Teacher that is not quantified by the dataset.

In the context of NLP models, distillation is typically done before fine tuning. This means that we must impart the knowledge learned from the Teacher model that is gathered during the pretraining phase. This phase of training means that the roBERTa model has been trained for the Language Masking task and the Next Sentence Prediction Task. Let’s examine the output from the roBERTa model for the Language Masking task. Consider the example input “There are 104,185 square [MASK] in Colorado.” The Teacher’s job is to figure out which word from its vocabulary best fills in the blank. For each word it assigns a probability, and finding the word with the maximum probability is the Teacher’s answer to the problem. Below is what this looks like graphically.

Figure 4 shows a basic plot of the output from the roBERTa model for the language masking task. Displayed here are the probabilities that the Teacher provides for each word in its vocabulary. What is notable is that in addition to getting the correct answer, the Teacher also knows that the words feet, meters, and inches also make sense while all the other words in the vocabulary have a probability that is nearly zero. This information is not captured by the dataset! The dataset specifies only the correct answer but not which other answers are good. The knowledge of the good answers determined by the Teacher is what we would like to impart to the Student model.

*Figure 5: Knowledge captured by the Teacher model during pretraining.*

The way the knowledge of the good answers is transferred to the Student is through the loss function. Essentially, we want to train the Student so that it mimics the same distribution that the Teacher provides. To do this, we must also understand what the Student outputs are before it is even trained.

Figure 5 shows the probabilities that the Student assigns to each word before training. The task now is to penalize the difference between the red curve of the Student and the blue curve of the Teacher. This is accomplished via a clever way of measuring how “different” the two distributions are from one another.

*Figure 6: Knowledge captured by the Student prior to any training.*

This measurement is called the Kullback-Leibler, or KL, divergence. This approximates the work it takes to turn the red curve into the blue curve. The result is a loss function that has a term measuring the KL divergence between the Student distribution and the Teacher distribution.

The specifics here can get messy, and there are many hyperparameters to adjust when actually performing the training, but the key ideas are given as:

Train a Teacher model
Identify a Student model to train
Train the Student with a KL term in the loss function

This is the most basic form of distillation, but the technique can be expanded upon in many ways. For example, tinyBERT cites an increased performance over other distilled models by imparting the knowledge of the Teacher pre– and post–fine tuning.

*Figure 7: Impact of distillation on model size, runtime, and accuracy.*

The effect of the distillation is primarily a function of the Student’s chosen architecture. Cutting more hidden layers from the Teacher model to form the Student results in even smaller model sizes and faster run times, but the accuracy on the downstream task might see more of an impact. These tradeoffs must be balanced on a case-by-case basis. As a rule of thumb, half as many layers implies twice as fast a model (with a small impact on accuracy), as seen in Figure 6.

Net benefit

While each of these model compression techniques can be implemented independently, they complement each other when used together. Below is the total impact of performing distillation, ONNX conversion, and quantization.

Size

Both distillation and quantization cut our model size approximately in half. The effect on model size from distillation is determined by the choice of the number of layers to omit when selecting the Student model, so this will have a high degree of variance depending on the specific implementation. Quantization, on the other hand, consistently reduces model size by a factor of two and, as can be seen here, combining distillation with quantization results in a model that is a quarter of the original size.

Runtime

The most notable impact is on the average inference speed after each model compression technique is applied. Distillation and quantization have a significant impact on the average inference speed of our model and the conversion to ONNX further increases inference speed. The net inference increase in speed is by a multiple of approximately 8.5 (in other words, 850 percent faster).

Accuracy

The only drop in accuracy comes from the distillation of the model. This drop in accuracy is sensitive to the specific techniques used to train the Student model along with the task on which the distilled model is fine tuned. In many cases, distillation reports a less than one-point change in the accuracy metric (distilBERT retains 95 percent of accuracy) though here we see an eight-point drop in our accuracy.

Conclusions

So how does this relate to our goal of processing 1,000 documents every 30 minutes? In the first place, we brought our average runtime per document down from 17.5 milliseconds to 2.05 milliseconds. That means that if now we want to process those 1,000 documents every 30 minutes we need a total of only two Tesla V100 GPUs operating in parallel rather than ten of them. The change in operating cost after performing these model compression techniques takes an initial hourly rate of $30.60 per hour down to $6.12 per hour. In terms of yearly expenses, this means a reduction from $61,200 per year to $12,240 per year, a net savings of $48,960 per year and, as document volume grows, so too do the savings. Adding a model compression step into our training, testing, and model deployment pipeline has a desirable cost-to-benefit ratio in this case and, perhaps, in one that you’ll encounter.