Linear Distillation Learning

7 min readJul 4, 2019

Highlights of https://arxiv.org/abs/1906.05431

Introduction

In our last paper we presents a Linear Distillation Learning (LDL) a simple remedy to improve the performance of linear networks through distillation.

In deep learning models, distillation often allows the smaller/shallow network to mimic the larger models in a much more accurate way, while a network of the same size trained on the one-hot targets can’t achieve comparable results to the cumbersome model. Our neural networks without activation functions achieved high classification score on a small amount of data on MNIST and Omniglot datasets.

Since the first neural networks already used threshold gates [1], linear networks became uninteresting in practise. On the contrary, from the research perspective, the study of linear networks leads to new insights in deep learning methods performance and provides theoretical models for generalization dynamics. Linear networks allow studying the structure of error function landscapes and apply this knowledge in nonlinear cases [2]. Our research seeks to create a mathematically tractable, scalable and simple model, which can help us to bridge the gap between the efficacy and lucidity of a deep learning model.

The approach is based on using a linear function for each class in dataset, which is trained to simulate output of teacher linear network for each class separately. When the model is trained, we can apply classification by novelty detection for each class. Our framework distilling randomized prior functions for data, since prior functions are linear, in couple with bootstrap methods it provides a Bayes posterior [3].

Model

One to Many Distillation

Consider a classification problem with object set 𝑋= ℝᵈ and label set Y = {1,…,C}. We are given a labelled dataset D = 􏰂{xⁱ, yⁱ} , where xⁱ ∈ 𝑋 and yⁱ ∈ Y .

Our idea is to create linear predictor P(𝜽) for each class that would simulate behavior of the target Q(𝜙) on this class. Every predictor is trained only on one class. The process of predictor training is performed by Mean Square Error between it and target outputs. We replaced the classification problem with the problem of approximating a linear function with linear functions for different classes, and call this method One to Many Distillation (O2MD). At the evaluation step, we can make prediction using the argmin of distances between each predictor and target network outputs for some sample.

Following the analogies in Random Network Distillation(RND)[8], our framework can be presented as randomized prior functions for data D. Osband et al. [3] show in the paper that bootstrap approaches [4] and randomized prior functions provides a Bayes posterior in the linear case and in comparison to Exact Bayes provides much cheap computing.

In this setting, we investigate a distribution over functions G(𝜽) = P(𝜽) + Q(𝜙), where parameters 𝜽 are specified by minimizing the expected prediction error with regularization R(𝜽) [3]. In our formulation, we have specific distribution G(𝜽) for each class c:

Parameters 𝜙 are drawn from prior q(𝜙) over the parameters of mapping Q(𝜙) and after updating on the evidence we can extract it from the posterior. In our case, if we set yⁱ to 0 for every class, according to RND each distillation error is a quantification of uncertainty in predicting the constant zero function.

By default, we interpret our model as an unbiased ensemble with shared parameters, but in practice we can actually consider the model as ensemble of target-predictor networks for each class. In this settings, predictions and target in each ensemble are taken as the sum of target and predictor functions.

During bootstrapping with zero target, ensemble without priors has almost zero predictive uncertainty as x becomes large and negative [5] which leads to to arbitrarily poor decisions [6].

Bidirectional Distillation

Our method has a limitation since a random initialized target can map classes in similar space. Due to the predictors simulate target on corresponding classes, when we compare outputs, we can more clearly distinguish one of the predictor if dissimilarity with other classes is much larger. For example, if the target output for class 1 is very different from the output for other classes, the trained predictor P(𝜽1) for this class will be closer to the target than other ones. It may be possible if the target outputs for each class are far from each other.

One of the way to choose Q(𝜙) that map classes in distinct regions is directly train it to do this. For example, by updating our target 𝜙 parameters using distillation from some teacher. To do so we proposed a training of a target network to predict the behaviour of many other networks for each class in dataset. This training procedure is inverted to O2MD. At the beginning we train our target network to predict behaviour of predictors of each class separately, and then trains predictors simulate target.

This a method we call Bidirectional Distillation. Each training epoch consist of two stages:

Train target to predict behaviour of predictors,
Train predictor to simulate target behaviour on each class.

Bidirectional Distillation Figure. We train “Target Network” to simulate each “Linear Network for Class N” and train each “Linear Network for Class N” to predict “Target Network” outputs.

Experiments

MNIST

In this Section, we describe the results of experiments, comparing One-To-Many and Bidirectional variants of a LDL model against a deep fully connected neural network, logistic regression and Naive linear model. In Naive settings, our predictors trained without target. Each predictor training like an autoencoder, but without any dimensionality reduction, during prediction we just measure distance between each predictor output and sample xⁱ.

All of our experiments are formulated within few-shot problem framework. Each model was provided with a set of m labelled samples from C classes. In a few-shot terminology, C is traditionally called a way and m is called a shot. The task is C-class classification. Unlike traditional few-shot learning models, our approach does not imply knowledge transfer between episodes, being thus more similar to small-sample learning [7]. To avoid bias between small number of samples and reported results, each model is trained for 100 independent trials and average accuracy is reported.

Accuracy curve for the MLP, One-tp-Many and Bidirectional Distillation models. The models are trained for 10 epochs on MNIST dataset with 1, 5 and 10 samples per class

Training a Bidirectional Distillation model and a O2MD version have advantages over the classic deep fully connected network on a small amount of data. Bidirectional training allows much more faster converging to an almost best capabilities of the model after the first epochs, because the target pre-trained on predictors simplifies training.

OMNIGLOT

The Omniglot dataset consists of 1623 characters from 50 different alphabets, where each of the characters was drawn by 20 different people. We augmented existing classes with rotations by multiples of 90 degrees and used 1200 characters from training and the remaining character classes for evaluation. We resized images to the size of 28 × 28 pixels and obtained the very same model settings as for the previous dataset.

We experimented with different learning rates (1e − 3, 1e − 4, 1e − 5) and optimizer types (sgd, adam, adadelta) and reported results from most promising configurations. We tested our models on theway N ofsizes 3,5 and 10 with shots m of size 1,3,5,and 10.

This model is sensitive to hyperparameters settings, but nevertheless, is able to learn classification with proper configuration.

Conclusion

In this paper we present an architect based on the several methods of random function distillation using linear networks. The motivation for our work was to create an architecture which consists of linear functions capable of classifying on a small dataset. We tested our model on several datasets and showed results comparable to the results of nonlinear models on small amounts of data.

For the Omniglot dataset, we tested our architecture in the few-shot learning paradigm. Our model does not have the key concept of the paradigm to preserve knowledge in the learning process between sub-datsetes (episodes). In the further studies will be focused on using our method in the few-shot tasks on a higher level. There is abundant room for further progress in using distillation as a method of learning and explore open questions in this area. In further work, we also will explore open questions around linear networks, to fully reveal their potential and capabilities.

References

[1] Gualtiero Piccinini. The first computational theory of mind and brain: A close look at mcculloch and pitts’s “logical calculus of ideas immanent in nervous activity”. Synthese, 141(2):175–215, 2004.

[2] Alberto Bernacchia, Máté Lengyel, and Guillaume Hennequin. Exact natural gradient in deep linear networks and its application to the nonlinear case. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3–8 December 2018, Montréal, Canada., pages 5945–5954, 2018.

[3] Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep reinforce- ment learning. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3–8 December 2018, Montréal, Canada., pages 8626–8638, 2018.

[4] Siva Sivaganesan. An introduction to the bootstrap (bradley efron and robert j. tibshirani). SIAM Review, 36(4):677–678, 1994.

[5] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5–10, 2016, Barcelona, Spain, pages 4026–4034, 2016.

[6] Ian Osband and Benjamin Van Roy. Bootstrapped thompson sampling and deep exploration.CoRR, abs/1507.00300, 2015.

[7] Jun Shu, Zongben Xu, and Deyu Meng. Small sample learning in big data era. CoRR, abs/1808.04572, 2018.

[8] Yuri Burda, Harrison Edwards, Amos J. Storkey, and Oleg Klimov. Exploration by random network distillation. CoRR, abs/1810.12894, 2018.