CNN Model Compression via Pruning

5 min readNov 27, 2021

With advances in deep neural network (DNN) and its application in computer vision (CV) such as image classification, object detection, or semantic segmentation, models tends to keep getting bigger, deeper, and more complex. However, the pace at which embedded systems runs faster, have more memory, and become cheaper is not as fast. There are many CV application that need to be run on edge and operate in real-time.

Most state-of-the-art models will run significantly slower on edge-devices. One strategy to tackle this problem is to compress the model by model pruning. But before going into detail lets refresh some basic concepts and terms about convolutional operations in the context of convolution neural network (CNN).

Basic Concepts of 2D Convolution

In its most fundamental, a convolutional pass in a DNN is just an image filter of with fixed size kernel, very similar to the Sobel filter that extract lines in a certain direction except the kernel weights in CNN are adjustable via back-propagation.

An input to a CNN model is often a 3 channel RGB image. If the image is convolved by 1 filter where this filter is made up of 3x3 kernel with 3 channels, it will yield a single feature map.

A feature map can be considered as a block with size [wf , hf , cf]. Similarly, a filter can also be viewed as a block of kernel with a dimension of [wk , hk , ck].

In a 2D-convolutional forward pass, the feature map will be convolved by a filter block with matching number of channels which will yield a single channel feature map.

The width and height of the feature map will most likely be smaller or the same size as the input image, but the number of feature map’s channels will equal to the number filters.

CNN Compression via Pruning

There are many aspects on how to go about pruning a network. These are some of the questions asked when designing a pruning protocol: what element of the network to be pruned, how to choose which items within the element to prune, how re-training will be done, and if pruning will be done layer-wise or multiple layers at once. There is no consensus yet on a single approach that works best, rather it is relative. The best pruning strategy often depends on the nature of the network’s architecture, the amount of training data available, and the characteristic of the data.

What Element to Prune?

1. Pruning individual weights or neural has already been a standard practice in model training: dropout and regularization. In this context, we can set the individual weights within the filter that are very small to zero. However, this type of pruning creates sparse matrix of feature-maps, kernels, or filters, which doesn’t automatically translate into system-wide speed-up.

2. Another method is to prune the least significant channels. Channels that are similar or redundant in a filter can be eliminated without affecting the overall generalizability of the entire network. This also means that the corresponding feature map channels in the current and next layer can also be eliminated, thus reducing the computation of the network even more.

3. Third approach is to prune filters within a layer of the network. The general steps are as follow: identify which filters are the least important, prune those filters, retrained the model, and repeat. There are many metrics on how to determine the importance of a filter. The simplest is calculating the sum of the absolute kernel weights for each kernel, and then sort them.

4. The last approach is a combination of individual weights, channels, and filter. This approach is obviously most flexible and can handle different scenario, however, this added degree of freedom also comes with higher complexity which makes the whole pruning process harder to automate.

How to Pick What to Prune?

Once we picked what element in the model to prune systematically, the next question is what which item to prune. There needs to be a metric that we can use to guide this selection process. Ideally, this metric should reflect the usefulness or importance of that weight, channel, or filter. A good baseline is to randomize what to prune i.e., randomly selecting which weight, channel, or filter to prune. The easiest selection metric is the sum absolute weight. To account for variances in each layer, sum of L2 norm can be used. Another is the averaged percentage of zeros in a channel or filter. A more advance method is to use the Taylor expansion to predict the loss that will incurred if selected weight, channels, or filter were pruned.

Re-Training Strategy

Retraining pruned networks to regain accuracy is another important step. There are generally two approaches. First approach is to prune multiple layers at once then retrain them until desired accuracy is restored.The second approach is to prune a single layer and then retrain until accuracy is restored so that the weights of the next layers will be adapted, after which pruning the next layers can continue. Iterative pruning and retraining often give better results, but takes longer. One additional note is to make pruning and retraining cycle as automated as possible. Remember, most modern DNN are at least hundreds of layers deep. It would be a nightmare to iteratively prune and retrain manually.

Conclusion

In the next post, I will share my experience pruning a YOLO model using the filter pruning technique to reduce the model size down by 90% without losing much performance. Then I ported the model to run on Raspberry Pi with the Intel’s Movidius Neural Compute Stick using the OpenVINO toolkit.