Pruning in Brains and Machines
Pruning is central to the development of neural networks in the brain and in artificial neural networks
By Max Langenkamp, Alexander Fichtl, and Jacob Phillips
At one point, you and I were just a fertilized egg cell inside our mother’s womb. Over the next eight months, we grew a brain. When our mother gave birth to us, we were wet and confused.
Eventually, we learned to read Harry Potter and the Chamber of Secrets, bike without hands on our handlebars, and peel a boiled egg. However, the connections in our brain were never more dense than when we were that tiny bundle of flesh, cradled in our mother’s arms. Despite all the skills we acquired, we never had more neural connections than the freshly born baby.
The big question in neural pruning is simple: how can we have more from less? How is it that fewer connections are associated with better skills?
In deep learning, we see that artificial neural networks (ANNs) behave similarly. By removing connections, not only can we speed up the time it takes to make a prediction, but we can sometimes also improve the accuracy of the network.
For this essay, we’ll use this simple question ‘how can we have more from less?’ as a lens to examine pruning in human brains and deep neural networks. We’ll begin with an overview of pruning in the brain, look at the how and what of pruning in deep learning before tying it together in the final section with an examination of why pruning might work.
I. What we know about pruning in the brain
Let’s break up what we know about pruning in the brain into three sections. First, we’ll start with the conditions associated with pruning. Then we’ll talk about the mechanisms behind pruning before finishing up the open questions about pruning.
I.I Pruning is associated with childhood brain development and mental disease
Pruning is most rapid during the earliest stages of human growth. In other words, we start with many connections and gradually lose them over the course of our life. The figure you see above shows neuron density versus human age. We begin with an astonishingly high density of neurons (and synapses) that rapidly decays between birth and the first year of age.
This is an incredible fact — we know that higher synaptic density is correlated with greater intelligence (Dicke & Roth, 2016), and it is obvious that adults are better at complex problem-solving tasks than children. Yet, somehow, the newborn toddler has a neuronal density more than five times that of the adult. We’ll consider possible explanations towards the end of this essay.
Besides the significant pruning that occurs in early childhood, we also know that there is a complex relationship between pruning and mental disease. In one of the first studies of synaptic pruning, Feinberg (1982) hypothesized that schizophrenia was caused by abnormal pruning — either too much or too little — during adolescence. We have subsequently learned that schizophrenic individuals have fewer synapses than typical individuals. Other research has suggested that abnormal pruning is also linked to autism. However, there isn’t a consensus on whether excessive, insufficient, or region-specific pruning is linked to autism.
Before moving to the next segment, we’d like to emphasize one point: pruning is not restricted to humans. Multiple studies have examined pruning in houseflies, zebrafish, and mice. In other words, pruning appears to be a deep evolutionary strategy for brain development. The patterns associated with pruning may differ across species, but systematic synapse removal seems to be present across the animal kingdom.
Now, let’s explore how pruning works.
I.II Two levels of mechanisms behind pruning
We can think of the mechanisms at two levels of precision: at the cell level and the organism level.
At the cell level, one of the major findings about the mechanism behind pruning is its close association with the brain’s immune system. As a reminder, there are three types of glial (i.e., non-neuron) cells in the human brain:
- Microglia eat debris within the brain (e.g. dead neurons) as phagocytes do in the immune system. They play a key part in the brain’s immune system
- Oligodendrocytes make the conductive tubing (myelin sheath), which allows neurons to fire faster
- Astrocytes regulate the chemical environment around the neurons
A series of studies in the last decade have provided strong evidence that microglia are heavily involved in synaptic pruning. Although the precise details remain vague, two popular mechanisms are neuron-intrinsic (where the neuron itself signals to be pruned) and microglia-dependent (where the microglia produces tagging proteins) pruning.
At the organism level, we know that neural activity plays a huge role in determining which synapses get pruned. In an informative (if cruel) experiment (Hubel & Wiesel, 1962), two scientists stitched shut the eye of an anesthetized kitten until it reached adulthood before undoing the stitches. They found that in the cat’s visual cortex, the neurons and synapses in the cortex corresponding to the shut eye had vastly diminished. In contrast, there was an abnormally high concentration of synapses in the visual cortex corresponding to the eye that remained open.
We’ll finish this section by considering two open questions that we think are particularly interesting:
- How can we move beyond ‘over pruning’ and ‘under pruning’ in the brain to instead describe the structures behind pruning?
- How much of a role does pruning play in saving energy as opposed to allowing the organism to learn and adapt to its environment?
II. Pruning in Deep Learning
Pruning plays a vital role in our brains, even if we do not fully understand the why and how. Computer scientists appear to have been inspired by neuroscience to apply pruning to artificial neural networks. Especially in the last decade, there has been a surge of publications about the strategic removal of connections in ANNs (Hoefler et al. 2021). We prune ANNs largely because of efficiency; a pruned network needs less memory to be stored, less time to compute, and less power to operate. However, as we will see later, pruning networks can make them generalize better too.
While we can make this clear distinction between efficiency and generalization benefits in deep learning, and directly assess the effects of different methods, things are a bit more involved on the biological side. It is easy to see that, for example, the pruning of motor neurons (Meirovitch et al., 2021) can improve energy efficiency in the body, but how pruning affects generalization or learning in the brain still requires more research.
II.II What can we prune in deep learning?
If we accept the analogy between the artificial neural network and our brain, the most obvious components to remove are the edge weights (‘synapses’) in an ANN. However, inputs and neurons as a whole are possible pruning candidates as well, and even gradients can be pruned using more advanced methods.
In general, research has shown (Denil et al., 2014) that up to 95% of the parameters in trained models can be removed without a significant drop in accuracy. We call the removal process ‘sparsification schemes’ and the model that remains the ‘sparsified model’.
The most simple and often most effective pruning techniques remove elements depending on how much they contribute to the inference of the network. This removal can either happen in a data-free or data-driven way.
The most common data-free approach is called ‘magnitude pruning’. Here we prune weights that are (close to) zero. In contrast, data-driven methods usually remove neurons or inputs based on the change of activation or feature values. If the neuron activation does not change for varying inputs, the neuron and its weights can be pruned.
There is also an important distinction to be made between model sparsity, where structural elements, like weights and neurons, are pruned, and ephemeral sparsity, where the flow of information through the network is sparsified (also known as activation sparsity). The flow is regulated by techniques such as dropout or with sparse activation functions. For example, ReLu has been shown to zero out up to 90% of activations (Rhu et al., 2018).
We can make the same distinction between model sparsity and ephemeral sparsity in vivo. The latter translates to the sparsity of action potentials: At any given time, only about 10% of neurons are active in the brain (Kerr et al., 2005).
II.III Structured vs. unstructured pruning
The removal of individual weights (based on magnitude) is considered unstructured pruning, which “requires storing the offsets of non-zero elements and handling the structure explicitly during processing” (Hoefler et al., 2021). In structured pruning, we strategically remove whole blocks of weights, or entire neurons, filters, or channels, which allows for more efficient index encoding techniques. For example, if we always prune weights in rows of 5 or 2x2 squares, we only have to store a single index representing a pruned segment.
Structured sparsity can also be designed from the start. The best examples for this are convolutional neural networks, where we re-use weights as shown in figure 5.
The possibility of pruning after training or designing neural networks sparsely from the beginning leads us to a new question in the next chapter.
II.IV When should we prune?
There are three points in the lifetime of a neural net where we might prune: before training, during training, and after training.
Pruning a network before training (‘fully sparse training’) means designing the neural network sparsely from the start. This approach makes training faster and possible even on small devices. The models can still be further pruned, but in this case, ‘regrowth’ — the process of re-adding certain pruned elements — is necessary to ensure that the sparsified models stay of roughly the same size (Hoefler et al., 2021).
We can also prune during training, which is commonly done iteratively. According to Hoefler et al. (2021), “sparsifying during training already reaps potential performance benefits of sparsity early on but [can] lead to less efficient convergence”. Regrowth of elements can also be implemented in this scheme.
Lastly, we can prune a network after it has been fully trained. The standard approach is a two-step process, where a trained model is pruned first, and the sparse model is then re-trained or ‘fine-tuned’. The best results are usually seen when this process is repeated multiple times.
So, which of the three schemes translates best to pruning in the brain?
Since neurons and synapses in the brain are pruned and regrown during our whole life, the most biologically plausible scheme is the “sparsify during training” approach with regrowth of elements. Specifically, regrowth by largest gradient has been shown to be mathematically identical to adding new synapses between highly active and correlated neurons in the brain (Dai et al., 2017) and is related to Hebbian learning.
II.V Can pruning ANN’s improve accuracy?
We have observed several instances where pruning improves prediction accuracy (Sun et al., 2016; Hoefler et al., 2021). Figure 7 shows a sketch for a common accuracy and performance curve given different sparsity levels. When beginning to prune the initially dense model, it slowly starts to generalize better (green curve). Within section B, we reach a “sweet spot” where the accuracy peaks, before dropping in section C where (almost) all parameters have been pruned.
Possible explanations could be given by the regularization effect of sparsity and the lottery ticket hypothesis, which we will explore later.
In terms of computational efficiency, the model’s performance steadily increases with growing sparsity (red curve). Storage and control mechanisms slow down the speed-up at the start and cause performance leveling when getting close to 100% sparsity, following Amdahl’s Law. This effect occurs because not all of the layers of the network are compute-bound, some are memory bound, and so reducing the amount of computation does not improve performance indefinitely.
II.VII Pruning makes optimization harder
Large neural networks work well because they are typically over-parameterized functions. Given the sizable multidimensional space the weight matrices provide, algorithms like stochastic gradient descent allow us to find the optimal global minima that give the best predictions for our desired task.
When we prune our models, however, the space becomes much smaller. It may not be possible to “route around hills in the loss landscape” in the pruned subspace (Hoefler et al., 2021). To illustrate this point, let’s look at a figure.
Above, the figure displays a model with two weights on the left and a pruned model with one weight on the right. s1 and s2 are two random initializations for each model. In the left plot, the optimizer can take advantage of both dimensions to route around local minima in black towards the optimal minimum in the yellow center. Meanwhile, in the right plot, the optimizer cannot find a direct route to the optimal minimum due to losing a degree of freedom in the loss landscape.
III. Why Pruning Works
III.I The Lottery Ticket Hypothesis
So we know a lot about pruning in practice, but what about why it works? Apart from the general regularization effect introduced by sparsity, one particularly promising idea is the Lottery Ticket Hypothesis (Frankle & Carbin, 2019).
Let’s say we start with a large model like ResNet. If we prune it to 50%, we will have a subnetwork half the original size. Frankle & Carbin found that by keeping track of the original weight initializations these subnetworks could be retrained to get the same subnetwork as after the pruning.
‘Tickets’ refer to the weight initialization, and ‘lottery tickets’ are the initializations of the winning subnetworks. The fascinating idea here is that, for a given task, every large model has a much smaller submodel within it that does most, if not all, the work. This has been observed for natural language tasks and across different datasets — there appear to be lottery tickets for broad vision tasks and others for broad language tasks. Here we have a natural question: ‘Is there a universal ticket that performs well across datasets, optimizers, and domains?’
Beyond a ticket for datasets, we might also ask ‘what about tickets for human brains?’ Are there collections of synaptic circuits in our brains that are robust for a series of different tasks? This is an intriguing idea, but it’s helpful to point out some disanalogies between artificial neural networks and human brains. For one, ANNs are optimized for a very specific task — whether it be license plate identification or English speech recognition. Our brains are highly general-purpose and adapt well to radically different environments. This means that defining a ‘winning ticket’ is much less straightforward; what’s the objective function for life in a modern western society? Another disanalogy is that, unlike ANNs, our brains are not initialized with random weights. We have differentiated brain regions present at birth. A theory of ‘winning tickets’ for the brain thus also needs to account for that.
In this essay, we’ve:
- Explained how pruning relates to aging and mental disease and explored how activity and the brain’s immune system can explain its mechanism
- How, when, and what pruning in deep learning models looks like
- Discussed the lottery ticket hypothesis and how it may explain the efficacy of pruning
We hope that this essay has given you a taste of how a relatively simple procedure is deeply linked with the development of intelligent behavior.
By Max Langenkamp, Alexander Fichtl, and Jacob Phillips. This essay was written as a final paper for the MIT class 6.881: Tissue vs Silicon in Machine Learning.
Bahrini, I., Song, J., Diez, D., & Hanayama, R. (2015). Neuronal exosomes facilitate synaptic pruning by up-regulating complement factors in microglia. Scientific Reports, 5, 7989. https://doi.org/10.1038/srep07989
Dai, X., Yin, H., & Jha, N. K. (2017, November 6). NeST: A Neural Network Synthesis Tool Based on a Grow-and-Prune Paradigm. https://arxiv.org/pdf/1711.02017
Denil, M., Shakibi, B., Dinh, L., Ranzato, M., & Freitas, N. de. (2013, June 3). Predicting Parameters in Deep Learning. https://arxiv.org/pdf/1306.0543
Dicke, U., & Roth, G. (2016). Neuronal factors determining high intelligence. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 371(1685), 20150180. https://doi.org/10.1098/rstb.2015.0180
Feinberg, I. (1982). Schizophrenia: Caused by a fault in programmed synaptic elimination during adolescence? Journal of Psychiatric Research, 17(4), 319–334. https://doi.org/10.1016/0022-3956(82)90038-3
Frankle, J., & Carbin, M. (2019). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. ICLR. https://arxiv.org/pdf/1803.03635
Gomez, A. N., Zhang, I., Kamalakara, S. R., Madaan, D., Swersky, K., Gal, Y., & Hinton, G. E. (2019, May 31). Learning Sparse Networks Using Targeted Dropout. https://arxiv.org/pdf/1905.13678
Han, S., Mao, H., & Dally, W. J. (2015, October 1). Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. https://arxiv.org/pdf/1510.00149
Hoefler, T., Alistarh, D., Ben-Nun, T., Dryden, N., & Peste, A. (2021, January 31). Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks. https://arxiv.org/pdf/2102.00554
Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of Physiology, 160, 106–154. https://doi.org/10.1113/jphysiol.1962.sp006837
Kerr, J. N. D., Greenberg, D., & Helmchen, F. (2005). Imaging input and output of neocortical networks in vivo. Proceedings of the National Academy of Sciences of the United States of America, 102(39), 14063–14068. https://doi.org/10.1073/pnas.0506029102
Louizos, C., Ullrich, K., & Welling, M. (2017, May 24). Bayesian Compression for Deep Learning. https://arxiv.org/pdf/1705.08665
Meirovitch, Y., Kang, K., Draft, R. W., Pavarino, E. C., Henao Echeverri, M. F., Yang, F., Turney, S. G., Berger, D. R., Peleg, A., Schalek, R. L., Lu, J., Livet, J., Tapia, J.‑C., & Lichtman, J. W. (2021). Neuromuscular connectomes across development reveal synaptic ordering rules (OXFORD UNIVERSITY PRESS). https://www.sciencegate.app/document/10.1101/2021.09.20.460480 https://doi.org/10.1101/2021.09.20.460480
Morcos, A. S., Yu, H., Paganini, M., & Tian, Y. (2019, June 6). One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. https://arxiv.org/pdf/1906.02773
Paolicelli, R. C., Bolasco, G., Pagani, F., Maggi, L., Scianni, M., Panzanelli, P., Giustetto, M., Ferreira, T. A., Guiducci, E., Dumas, L., Ragozzino, D., & Gross, C. T. (2011). Synaptic pruning by microglia is necessary for normal brain development. Science (New York, N.Y.), 333(6048), 1456–1458. https://doi.org/10.1126/science.1202529
Peter R., H. (1979). Synaptic density in human frontal cortex — Developmental changes and effects of aging. Brain Research, 163(2), 195–205. https://doi.org/10.1016/0006-8993 (79)90349–4
Renda, A., Frankle, J., & Carbin, M. Comparing Rewinding and Fine-tuning in Neural Network Pruning. ICLR. https://arxiv.org/pdf/2003.02389
Rhu, M., O’Connor, M., Chatterjee, N., Pool, J., Kwon, Y., & Keckler, S. W. (2/24/2018–2/28/2018). Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) (pp. 78–91). IEEE. https://doi.org/10.1109/HPCA.2018.00017
Salas, I. H., Burgado, J., & Allen, N. J. (2020). Glia: Victims or villains of the aging brain? Neurobiology of Disease, 143, 105008. https://doi.org/10.1016/j.nbd.2020.105008
Saxe, A. M., McClelland, J. L., & Ganguli, S. (2013, December 20). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. https://arxiv.org/pdf/1312.6120
Stevens, B., Allen, N. J., Vazquez, L. E., Howell, G. R., Christopherson, K. S., Nouri, N., Micheva, K. D., Mehalow, A. K., Huberman, A. D., Stafford, B., Sher, A., Litke, A. M., Lambris, J. D., Smith, S. J., John, S. W. M., & Barres, B. A. (2007). The classical complement cascade mediates CNS synapse elimination. Cell, 131(6), 1164–1178. https://doi.org/10.1016/j.cell.2007.10.036
Su, J., Chen, Y., Cai, T., Wu, T., Gao, R., Wang, L., & Lee, J. D. (2020, September 22). Sanity-Checking Pruning Methods: Random Tickets can Win the Jackpot. https://arxiv.org/pdf/2009.11094
Tjarko L., R. (2020). The Lottery Ticket Hypothesis: A Survey. https://roberttlange.github.io/posts/2020/06/lottery-ticket-hypothesis/
Wikipedia (Ed.). (2021). Studies of the Fetus in the Womb. https://en.wikipedia.org/w/index.php?title=Studies_of_the_Fetus_in_the_Womb&oldid=1005536417
You, H., Li, C., Xu, P., Fu, Y., Wang, Y., Chen, X., Baraniuk, R. G., Wang, Z., & Lin, Y. (2019, September 26). Drawing early-bird tickets: Towards more efficient training of deep networks. https://arxiv.org/pdf/1909.11957