“Mobile friendly” deep convolutional neural networks — Part 3

Siddarth Malreddy
2 min readJun 22, 2016

--

Read part one here and part two here.

The amount of research being done on improving the run time of CNNs and compressing their models is staggering. I’m going to cover two major techniques being used for the same in this article.

  • Hardware Accelerators:

Because CNNs run on images and each kernel in a layer is independent of the others, they are parallelizable and are the perfect candidates for hardware acceleration. Since each kernel in a layer is independent of the other kernels, they can be run on different cores or even different machines. Having dedicated hardware means power and speed improvement at the cost of area and price. But the effectiveness of deep learning is motivating hardware manufacturers to invest in developing such hardware. Qualcomm zeroth and Nvidia Digits are examples of such accelerators.

  • Mathematical Techniques:

Since a convolutional neural network is basically a series of tensor operations, we can use tensor rank decomposition techniques to decrease the number of operations that need to be done for each layer. The paper “Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications” uses Variational Bayesian Matrix Factorization for rank selection and tucker-2-decomposition to split each layer into three layers.

For a convolutional layer of size T x S x K x K, rank selection is done for the 3rd and 4th axis (P and Q). Then this layer is decomposed into three different layers (P x S x 1 x 1, Q x P x K x K, T x Q x 1 x 1).

This type of architecture is also found in ResNet, SqueezeNet and the inception layers in GoogleNet. It can be intuitively justified by considering that the input layers are correlated. So their redundancy can be removed by properly combining them with 1 x 1 layers. After the core convolution, they can be expanded for the next layer. The loss in accuracy due to this operation is compensated using fine-tuning.

Alternatively, techniques like pruning and weight sharing are used to compress the model thereby decreasing the model size as detailed in Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. The author claims that the network learns redundant connections during training. So he proposes to remove such connections and keep only the most informative ones. He does this by removing connections with weights below a certain threshold and finetuning the remaining weights. Using this technique he achieves a 9x reduction in parameters for AlexNet. He also uses k-means clustering to identify weights that can be shared in a single layer.

With this, I conclude my three part series on making deep learning “mobile friendly”. Please let me know if I you have any suggestions.

--

--