Generalization in multitask deep neural classifiers: a statistical physics approach — Read a paper

Vigneswaran C

Published in

Read a Paper

3 min readNov 7, 2020

Generalization in multitask deep neural classifiers: a statistical physics approach

A proper understanding of the striking generalization abilities of deep neural networks presents an enduring puzzle…

arxiv.org

Training multitask deep learning models, has become an increased interest to the scientific community, as it ensures, that the deep learning models shall be extended to more generalized and intuitive systems. The most prevalent method of training multitask model is to add an auxiliary task to the primary task, such that the model should able to collect the optimal features to solve both the objectives simultaneously. However, there has been no study, clearly stating how should the main primary task should be “related” to other auxiliary tasks to gain improved multitasking benefit. In this work, the authors gives a succinct explanation on quantifying the “relatedness” with valid empirical and experimental underpinnings.

Student-Teacher Networks

During training, the teacher gives out noisy vector outputs (by adding noise to the network weights). Whereas training and testing in the student network is tested against the noise-free labels. In multitask setup, the student network is stacked with shared hidden layers followed by task specialized branches. The loss function is set as weighted sum of two tasks (primary and auxiliary task) loss_overall = Wa * loss_a + Wb * loss_b (Wa and Wb are weights for loss_a and loss_b respectively).

Multitask Dynamics

In [Ref1], the authors proposed the task-relatedness in multitask learning as the function of angles between the singular vectors of the implicit learned mapping by the network. (Note: you can refer this link to know about Singular Value Decomposition and singular vector). Picking up the main task as “A” and auxiliary task as “B”, whose main aim is to improve the performance on task “A”, the relatedness between them shall be written as,

Source

and W_A an W_B are weight matrices of teacher network for task A and B respectively, and their SVDs are,

Source

Multitask Benefit

Increase in relatedness and Signal to Noise ratio (SNR) directly improves the multitask benefit
Similarly, unrelated and and less SNR deteriorates the learning
Before the multitask learning, a base performance with main task shall help to improve the multitask benefit

SNR is increased with increase in singular values for teacher A (from 0.1 to 10), and multitask benefit is directly correlated with task relatedness and SNR for related task. Similarly, multitask benefit is also contingent upon the training points.

With increase in amount of auxiliary task data points and relatedness along with high SNR main task, gives maximum multitask benefit. Also, increasing the non-linear student network depth also affects the same.

The work proves its claims of multitask benefit with proper empirical and experimental results. Defining the concept of task relatedness, drawing them in theoretical sense using dynamics of learning and figuring out the factors that affect are the highlights of the work.

Hope you enjoyed reading :)

Reference

Lampinen, Andrew K., and Surya Ganguli. “An analytic theory of generalization dynamics and transfer learning in deep linear networks.” arXiv preprint arXiv:1809.10374 (2018).