Gate Decorator: Global Filter Pruning

9 min readNov 7, 2019

In recent years, we have witnessed the remarkable achievements of CNNs. Iterative improvements for a task require bigger models and more computation. However, bigger models with huge memory footprints and computation requirements prevent the deployment of these models on mobiles and edge devices. This paper aims to find a solution to this problem.

Why deployment on mobiles and edge devices is hard?

Mobiles and edge devices are resources constrained. The constraints mainly come from three aspects:

Model size
Run-time memory
Number of computing operations(FLOPS)

To get an idea of “why”, let’s take VGG-16 as an example. The model has 138 million parameters, consumes more than 500MB and requires more than 16 billion FLOPS with extra 93MB memory for doing inference on a single image. Let’s take another example but of image segmentation. DeepLabv3 with MobileNet backbone takes almost a second on a very powerful CPU for single image inference.

These scenarios shows that deploying SOTA models and inferencing in real-time on mobiles and edge devices is very hard. Therefore network compression and acceleration methods have aroused great interest in research in this direction.

Compression and acceleration can be divided into four categories:

Quantization: This is a very popular method for network compression. TensorFlow has been using quantization techniques for a long time making it possible to deploy models on mobiles and edge devices. Quantization involves converting weights values from fp32 to int8, restricting values to binary {-1, 1} or ternary {-1, 0, 1} in some cases. Although this method works but the performance almost always degrades.
Fast Convolution: Convolution layers are very expensive in terms of computation. To reduce computation complexity, people have come up with new ideas like factorized convolutions, e.g. OctConv. OctConv works well, the only downside is that you have to retrain the pre-trained models with it if you are planning to use them in your workflow.
Low-rank approximation: Low-rank decomposition methods approximate network weights with multiple lower rank matrices using techniques like SVD. This method works especially well on fully-connected layers, yielding ∼3x model-size compression however without notable speed acceleration, since computing operations in CNN mainly come from convolutional layers.
Filter pruning: This is far by the most popular technique for model compression. It has received a lot of attention in recent years because it doesn’t require any changes to the model design philosophy, can be applied to different types of CNNs, and requires no specialized hardware or software for acceleration.

This paper extends the idea of filter pruning to global filter pruning. We will dive into the details shortly but let’s talk a little bit about pruning first.

What really is pruning?

Neural pruning was first introduced in Optimal Brain Damage in which LeCun et al. found out that some neurons could be deleted without noticeable accuracy loss. When it comes to CNNs, the same thing is applied but on filter level, hence it is known as Filter Pruning.

Filter pruning can be divided into two categories:

Layer-by-layer pruning: Remove filters from one layer at a time until certain conditions are met. Once those conditions are satisfied, the feature reconstruction error for the next layer has to be minimized. Though it looks straightforward, there are two problems with this approach. First, if your model has too many layers, then this process is very time-consuming. Second, it requires a pre-defined pruning ratio for each layer, limiting its usage in neural architecture search.
Global pruning: As the name suggests, this technique prunes the unwanted filters globally irrespective of the layers. The advantage of this is that now you don’t have to set the pre-defined pruning ratio for each layer separately. It focuses the on Global Filter Importance Ranking(GFIR).

Extending this idea of global pruning, the authors proposed a novel global filter pruning method which consists of two components:

Gate decorator: An algorithm to solve the GIFR problem.
Tick-tock: A pruning framework to boost pruning accuracy.

We will discuss both of them in detail. We will start with the problem definition and the role of Gate decorator in it. Afterward, we will look at how Tick-Tock comes into the picture.

Problem definition

Let X be the input data, Y the corresponding labels and θ be the parameters of the model. K is the set of all the filters of the network. The goal of pruning is to choose a subset of filters k ⊂ K and remove the corresponding parameters. To choose the optimal k by solving the following optimization problem:

In other words, we want to minimize the loss increase when we are removing these filters from the original network.

Why do you write these stupid equations and make a simple task difficult? The above task seems simple to me. Check for every possible value of k and choose the best one, one that has the least effect on the loss. 🤦‍♂

Although you can do that, you need to calculate the loss difference for ||K||₀ times to do just one iteration of pruning. This would become infeasible where your network has tens of thousands of filters. This is why the authors proposed the Gate Decorator to calculate the importance of filters efficiently.

Gate Decorator

The authors introduced a trainable scaling factor ϕ ∈ ℝ. Assuming that feature map z is the output of the filter k, the output is now transformed as z’ = ϕz. If the value of ϕ is zero, that means the filter is not important and this operation is equivalent to pruning this filter in that case. The loss difference (∇ L) can be approximated using Taylor expansion. Denoting X, Y, and all other parameters except ϕ using ω, we can rewrite the above equation as:

R₁ is the Lagrange remainder and we can ignore this term as it requires a massive amount of calculation. The Global Importance Filter Ranking (GFIR) can now be solved using this equation:

The term Θ(ϕᵢ) is the importance score for filter kᵢ ∈ K. If the network consists of Batch Normalization(BN) layers, then Gate Decorator is applied directly to BN converting it into Gated Batch Normalization(GBN). If there is no BN layer, then the Gate Decorator can be directly applied to the convolution converting it into Gated Convolution.

Woah Woah! Hold on for a second! Transforming the output of a convolution makes sense but using BN instead of a convolution feature map, why?

That’s a great question. BN already includes a scaling parameter γ, so why not leverage it for our purpose? Also, if we add another scaling layer, the effect of the same can be cancelled by the BN layer. So, how to turn a BN layer into a GBN layer for calculating GIFR? This is how we do it:

Because γ can provide clues about filter importance, we make the following changes to convert a BN layer to GBN.

During the pruning procedure, γ is fixed and non-trainable. When the pruning is finished GBN is converted back to BN by merging ϕ into γ and β

Hmm. What about networks that don’t have BN layers? Do we need to make some extra changes for them too?

Pure convolution is just a linear transformation. If you add another scaling layer, then pruning won’t be meaningful. One can obtain the same results by decreasing the scaling factor values while amplifying the weight values in the convolution layer.

The convolution operation can be denoted as:

To turn a Convolution layer to a Gated Convolution, the authors propose the following the changes:

The convolution can now be written as: Y = ϕ(X * W)
When pruning is finished, ϕ is merged into W as: W = ϕW

We have formulated the process for GIFR. To solve the optimization, we need one last piece: A pruning framework.

Tick-Tock Pruning Framework

The authors introduced Tick-Tock, an iterative pruning framework to improve pruning accuracy. Tick and Tock are two different phases.

The Tick step is designed to achieve the following goals:

Speed up the pruning process.
Calculate the importance score Θ for each filter.
Fix the internal covariate shift problem caused by previous pruning.

During the Tick phase, the model is trained on a small subset of the training data for one epoch. Because it is trained on a subset, only the gate ϕ and the final linear layer are made trainable in order to avoid overfitting. Once the training is finished, the filters are sorted by their importance score Θ and the least important filters are removed. This process can be repeated T times for different subsets.

The Tock phase is then used to fine-tune the network to reduce the accumulation of errors caused by removing the filters. In addition to the original loss function L, we need to add a sparsity constraint. This constraint will help in estimating Θ, the importance score for the filters, more accurately.

Once the Tock phase is complete, the pruned network is fine-tuned again.

Again? Why are you fine-tuning it two times? You burn my GPUs like this?

The final fine-tuning is different from the Tock phase in two ways:

No sparse constraint is added in the final fine-tuning.
It is trained for much more epochs as compared to the Tock phase.

Aah. Makes sense. Let’s prune our ResNet101 then. Once done, let’s go and sell the pruned version but we won’t tell anyone that it’s the same model. We need to make money! Wait..What’s the problem now? 🙄

Remember the skip connections in ResNets? Skip connection applies element-wise addition on the feature maps produced by the residual blocks. If we prune the filters independently, then it may results in the misaligned feature maps in the shortcut/skip connection.

Nice catch. So, what solution do the authors proposed for this?

Group Pruning for the Constrained Pruning Problem

To solve the misalignment problem, the authors proposed Group Pruning. In group pruning, we assign the GBNs connected by the pure shortcut connection (shortcut connection without a Convolution layer). Check the figure below to get a better picture of what it actually means.

What does Grouping ensure? And how is the importance score calculated for the groups in this case?

Grouping ensures that the pruning pattern for all layers in a group is the same, hence no misalignment. The importance score for a group is calculated by summing up the scores for its members as shown below:

Θ(ϕⱼ) is the importance score of jth filter of all members in group G. And that’s it! We have all the pieces we need to prune our CNNs.

Experimental Results

To check how effective this technique is, let us take a look at some of the experimental results that the authors carried out.

Conclusion

This is certainly a very good paper, in fact, a very well written detailed paper on this topic. I saw a similar paper from Nvidia in ICCV2019 but IIRC, they are exploiting the variance of the activations but the idea is very much similar. I liked this approach very much. In my opinion, these are some of the main pros and cons of this approach:

Pros

Simple and Effective.
No modification in model design philosophy.
Can be applied to all kind of CNNs.
We can design even bigger models and be assured that we can design a compressed version of the same for deploying it on mobiles/edge devices.

Cons

Involves three phases of training, out of which two are fine-tuning. Hence it is more expensive as compared to the other approaches.
Sampling for the Tick phase needs to be done carefully to make sure that no bias is introduced for pruning. In fact, this has to be done for quite a few subsets to make sure we have an unbiased estimator.