“Mobile friendly” deep convolutional neural networks — Part 2 — Making Deep Nets Shallow

Siddarth Malreddy
3 min readJun 14, 2016

--

Read part one here.

The first question that comes to mind when thinking about making deep nets shallow is why do we need to make them deep in the first place?

The answer is that any network needs to learn a complex function which maps the inputs to the output labels, and the present learning methodologies are not capable of teaching a shallow network the complex function directly. Prof. Rich Caruana discusses this in his paper “Do Deep Nets really need to be Deep?”. He proves that by training a given dataset using an ensemble of deep nets and then transferring the function learned on these deep nets into a shallow net, it is possible to match the accuracy of the shallow net to that of the ensemble. However, simply training the shallow net using the original data gives an accuracy which is significantly lower. Thus, by transferring the function we can make the shallow network learn a much more complex function which otherwise would not have been possible by training directly on the dataset.

But how do we transfer the learned function? We run the dataset through the ensemble and store the logits (the values that are input to the SoftMax layer). Then we train the shallow network using the logits instead of the original labels. This can be intuitively understood because by training on the logits, we are not only training the model on the right answer but also on the “almost right” answers as well as “not at all right” answer.

For example given an image of a car to the ensemble, the logit value of it being a truck would much higher than it being a flower. Lets assume that the logits are car — 3, truck — 0.1, flower — 0.00001. If we train the shallow network using these probabilities, we are telling it that this image has a signifiant probability of being a car or a truck but it is definitely not a flower. So while training we are presenting the model with much more information than we would have if we had used labels.

Dr. Geoffrey Hilton took it a step further and proposed that we should use the soft targets (weighted SoftMax function outputs) instead of logits. In his paper “Distilling the Knowledge in a Neural Network” he proposes that this provides more information to be learned by the shallow model.

Now that we know how to make our deep networks shallow (and thereby save processing time and power), in the next part we will look at techniques to make the network run faster.

P.S. Check out Prof. Caruana’s talks here and here.

P.P.S. Dr. Hilton also shows that the model can learn to recognise inputs that it has never seen before just by inferring its structure from the soft targets. He calls it Dark Knowledge. Check out his talk if you’re interested.

Update: Prof. Caruana’s published a new paper — “Do Deep Convolutional Nets Really Need to be Deep (Or Even Convolutional)?” in which he concludes that shallow nets can emulate deep nets given that they have multiple convolutional layers.

--

--