DeepMind’s PathNet: A Modular Deep Learning Architecture for AGI
PathNet is a new Modular Deep Learning (DL) architecture, brought to you by who else but DeepMind, that highlights the latest trend in DL research to meld Modular Deep Learning, Meta-Learning and Reinforcement Learning into a solution that leads to more capable DL systems. A January 20th, 2017 submitted Arxiv paper “PathNet: Evolution Channels Gradient Descent in Super Neural Networks” (Fernando et. al) has in its abstract the following interesting description of the work:
For artificial general intelligence (AGI) it would be efficient if multiple users trained the same giant neural network, permitting parameter reuse, without catastrophic forgetting. PathNet is a first step in this direction. It is a neural network algorithm that uses agents embedded in the neural network whose task is to discover which parts of the network to re-use for new tasks.
Unlike more traditional monolithic DL networks, PathNet reuses a network that consists of many neural networks and trains them to perform multiple tasks. In the authors experiments, they have shown that a network trained on a second task learns faster than if the network was trained from scratch. This indicates that transfer learning (or knowledge reuse) can be leveraged in this kind of a network. PathNet includes aspects of transfer learning, continual learning and multitask learning. These are aspects that are essential for a more continuously adaptive network and thus an approach that may lead to an AGI (speculative).
Let’s examine the architecture:
to get a better understanding of the techniques that were employed. A PathNet consists of layers of neural networks where the interconnection between each network in a layer is discovered through different search methods. In the figure above, the configurations are constrained to select 4 networks per layer at a time. The paper describes two discovery algorithms, one based on a genetic (evolutionary) algorithm and another one based on A3C reinforcement Learning.
The Outrageously Large Neural Network, described as follows:
achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers.
The key architectural component is depicted as follows:
Which shows a conditional logic component that selects from a mixture of experts. We’ve discussed conditional logic in DL in a previous article “Is Conditional Logic the New DL Hotness”. The basic idea here is that if conditional components can be used, then much larger networks can be built that operate with different experts that are dependent on context. So a single context will use a small subset of the entire network.
Convolutional Neural Fabrics are an alternative approach to hyper-parameterization tuning:
That is, rather than running multiple different configurations and discovering what performs well, a fabric attempts different paths in a much larger network and attempts to discover an optimal path while simultaneously reusing previously discovered optimal sub-paths in the network. In the figure above, shows how a networks are embedded within a fabric. One may also consider this approach as a variation of a previously published meta-learning technique that searches for different network architectures.
In a previous article “DL Predictions for 2017”, I point out to four emerging trends: #3 Meta Learning, #4 Reinforcement Learning, #5 Adversarial & Cooperative Learning and #7 Transfer Learning. It is simply just fascinating how PathNet seems to incorporate all four trends into a single larger framework. What we are seeing here is that fusing these different approaches may lead to novel and promising new architectures.
The PathNet architecture also provides a roadmap towards more adaptive DL architectures. DL monolithic architectures are extremely rigid after training and remain fixed when deployed. This is unlike biological brains that are continuously learning. PathNet allows for new contexts to be learned on the same time, leveraging knowledge of training in other contexts to learn much faster.
These new architectures may also lead to new thinking on how Deep Learning hardware can be optimally architected. Large networks of neural networks that are sparsely interconnected implies that there is an opportunity to architect more power efficient silicon by be able to power off many sub-networks at a time.