The Lottery Ticket Hypothesis
Development towards incrementally faster and easier-to-train neural networks has been an active area of study during the last decade. While many methods and techniques have evolved from this — Batch norm, Adam and residual connections, just to name a few — what if the true answer to more efficient models lies in their random initialisation?
This hypothesis is strengthened by some key observations. In particular, researchers studying pruning have found that it’s possible to train relatively smaller networks successfully as long as its parameters are initialised properly. Otherwise, such small models often struggle during training, which in many cases are attributed to their limited capacity. Below are two quotes from researchers which highlight these findings and the role initialisation plays.
Training a pruned model from scratch performs worse than retraining a pruned model, which may indicate the difficulty of training a network with a small capacity — Li et al.
During retraining, it is better to retrain the weights from the initial training phrase for the connections that survived pruning than it is to reinitialise the pruned layers… gradient decent is able to find good solutions when the network is initially trained but not after re-initialising some layers and retraining them — Han et al.
It seems like, if we train a model, prune the weights of smallest magnitude and then reinitialise the resulting model with its original parameters we end up with a small model that is trainable. If the same, pruned model, on the other hand, is reinitialised with random parameters the performance achieved is often significantly lower if training succeeds at all.
It can, therefore, be assumed that the initialisation uncovered by pruning holds some property that allows the subnetwork to be effectively trained. But there is actually more to it than that! In this article, we explore the findings of Frankle and Carbin in their paper The Lottery Ticket hypothesis where they thoroughly study this phenomenon and discuss its implications.
The hypothesis is postulated as follows:
A randomly-initialised, dense neural network contains a subnetwork that is initialised such that — when trained in isolation — it can match the test accuracy of the original network after training for at most the same number of iterations
The hypothesis state that the initialisation of this subnetwork allows us to train this smaller network for fewer iterations while at the same time reaching a higher test accuracy! This almost sounds too good to be true. Fortunately, the authors strengthen their claims through a thorough analysis of this phenomenon.
To illustrate the process of finding the subnetworks referred to in the hypothesis, please have a look at the animation below.
Experimental results are presented for a variety of models (LeNet, both a scaled-down and more traditional sized VGG, and ResNet) to test the two aspects of their hypothesis: training speed, measured as the number of iterations each network need to reach minimum validation loss, and performance.
In general, the authors were consistently able to find subnetworks with fortunate initialisations, so-called winning tickets, through both one-shot pruning and iterative pruning. The latter method only removes a fraction of the desired number of pruned parameters in each iteration, and resets the surviving parameters before pruning again.
For the fully connected, two-layered LeNet 300–100; pruned models 20% the size of their original model were able to achieve slightly higher performance (0.3 percentage points) with only 62% of the training steps.
Reducing the model size beyond 20% impacts performance negatively. However, this decrease is small enough to allow the model with only 3.6% of the original models' parameters to match its performance!
Common for pruned models of all sizes is that a random re-initialisation of their parameters drastically reduces their performance, in line with previous findings. The initialisation uncovered by pruning is therefore of great importance.
It is also worth noting that the difference in performance between train and test is reduced when comparing pruned models to the original one. This indicates better generalisability which the authors speculate comes as a result of a model with less complexity, thus less prone to overfitting.
Scaled-down VGG — A Simple Convolutional Model
The phenomenon observed for LeNet emerges for the convolutional, scaled-down VGG network too. Here, however, the difference is even more pronounced! Training time is reduced by a factor of three (33% of original training time, see left graph in the figure below) while at the same time achieving a three percentage point improvement (right graph in figure below). Furthermore, all three VGG size-variations (2, 4 and 6 convolutional layers) are able to outperform their original models when pruned to 2% in size (also the right graph in the figure below)!
Randomly reinitialising these models show a smaller decrease in performance compared to previous findings for LeNet. This leads the authors to conclude that both initialisation and network architecture play important roles in finding winning tickets in this case.
Dropout is a common training aid, described by its inventors as a way to train an ensemble of all subnetworks. It’s therefore interesting to find out how it interacts with the pruning techniques for finding winning tickets. Experiments show that networks trained with dropout perform even better compared to previously pruned networks. The authors speculate that dropout favours sparser networks which primes them for pruning.
Larger Models — VGG-19 and ResNet-18
Motivated by studying models actually used for real problems, the authors repeat their previous analysis for VGG-19 (20M parameters) and ResNet-18 (271k parameters).
Winning tickets can still be found for both these networks. However, the process is less stable — especially sensitive to the learning rate. The authors suggest learning rate warmup as a possible remedy, matching or slightly outperforming the original networks when pruned to 11.8% and 1.5% for ResNet and VGG respectively. In a follow-up paper, Frankle and Carbin find that reinitialising the pruned network to its state a few steps into training might be an even better solution.
Properties of a Winning Ticket
Since the parameter initialisation of the pruned networks is key to enable them to achieve better performance, a fair question to ask is whether these parameters are “already trained”. The authors answer this question through a quantitative analysis and find that it’s not necessarily the case. Rather, the uncovered parameters often differ widely between initialisation and after training. Therefore, the authors speculate that the found initialisation is tailored for this particular optimisation algorithm, model architecture and dataset.
The Lottery Ticket Hypothesis postulates that there exists a subnetwork within fully connected networks which is able to outperform its “parent” in training time, performance and generalisability. Specifically, these subnetworks won the initialisation lottery which is shown to be especially important for their success. Through a process of pruning, where after training the parameters of smallest magnitude are removed, these subnetworks can be uncovered.
When model size grows larger, so does the difficulty of finding these winning tickets. The authors suggest reduced learning rate or learning rate warmup as possible remedies which in some cases enable models of 1% the size of the original model to achieve higher test performance.
Creation of smaller and more efficient models is something that will be of increasing importance in the future, especially for the field of NLP. Here, model size is heavily correlated with performance which leads many researchers to innovate through the creation of even larger models. While the results from this line of work are impressive — just look at what GPT-3 has enabled — their practicality for many use-cases can be questioned due to the hefty computational requirements.
If you found this summary helpful in understanding the broader picture of this particular research paper, please consider reading my other articles! I’ve already written a bunch and more will definitely be added. I think you might find these paper summaries, covering more specific ways to achieve model compression, both interesting and insightful 👋🏼🤖