A Primer on Multi-task Learning — Part 2

Neeraj Varshney

Published in

Analytics Vidhya

3 min readOct 9, 2020

Towards building a “Generalist” model

This is part 2 of the article series on Multi-task learning (MTL) and covers the basic approaches for MTL.

Part 1 of this article series is available here and gives an introduction to Multi-task learning.

Outline Part 2:

Approaches for Multi-task learning
— Hard Parameter Sharing
— Soft Parameter Sharing
Basic Training Strategies for Multi-task Learning
— Instance Sampling Approaches
— Epoch Sampling Approaches

Part 3 of this article series is now available here.

Approaches for Multi-task Learning

In this section, we will look at the common ways to perform multi-task learning in deep neural networks.

Hard Parameter Sharing

In the hard parameter sharing approach, the model shares the hidden layers across all tasks and keeps a few task-specific layers to specialize in each task.

Soft Parameter Sharing

In the soft parameter sharing approach, each task has its own set of parameters. These task-specific layers are then regularized during training to reduce the differences between shared layers. This encourages layers to have similar weights but allows each task to specialize in specific components.

Basic Training Strategies for Multi-task Learning

In this section, we will go over the basic training strategies for the MTL problem where the output space for all tasks is the same.

Instance Sampling Approaches:

To determine the number of instances to draw from each dataset for each epoch

Uniform
— Uniformly sample instances for each task.
— Number of instances for a task is bottlenecked by the task having the smallest dataset.
— Tasks with large datasets suffer constrained learning as they fail to use the entire dataset for training.
Size-dependent
— Sample instances in proportion to their dataset size.
— Favors tasks with large datasets.
— This can result is underfitting the tasks with small datasets and overfitting the tasks with larger datasets.
Uniform → Size
— Uniformly for the first half of training and based on dataset size for the second half.
Dynamic
—Sample instances based on the gap between the performance in the current epoch and the performance of the single-task model.
— The number of instances sampled for each task changes after every epoch as it samples from instances for tasks that need training (large performance gap from the single-task counterpart model) and fewer instances for tasks that have converged (marginal performance gap from the single-task counterpart model)

Epoch Sampling Approaches:

To determine the order of instances within an epoch

Partitioned Batches
— Train sequentially on tasks i.e train using all the instances of one task before beginning training on the next task.
— This is bound to lead to catastrophic forgetting (forgetting the previously learned tasks as you learn the new tasks).
Homogeneous Batches
— Each batch contains instances of 1 task only but the batches are shuffled i.e the model is learning all the tasks together just that in one batch samples from only one task are present.
Heterogeneous Batches
— Combine the datasets for all tasks and shuffle the entire data collection.
— Each batch can contain instances of many tasks.
Uniform Batches(Forced heterogeneity)
— Equal number of instances of each task in every batch.