The Loose Coupling Principle and Modular Deep Learning
We seek out universal guidelines as to how best to design our intelligent systems. Here we focus on Artificial Intuition (alternatively intuition machines). Monica Anderson uses the term “Model-Free” as the universal guideline as to how to build artificial intuition. Her logic goes along the lines of: you can’t build intelligent systems using intelligent components, all this does is push the problem down to lower level components. I agree with this in that, intelligent systems are composed of unintelligent simple components and intelligence comes from emergent behavior.
I however would like to explore this idea that components of an intelligent system should function in low information coupling contexts (aka loose coupling). If we were to assume that brains are massively parallel systems consisting of diverse recognition components, then these components ought to be able to perform their jobs with the least amount of information. As a consequence, in the design of intelligent systems, one should always favor mechanisms with low information coupling.
It turns out this principle appears to line up quite well with our understanding of how artificial intuition based systems work (note: Deep Learning is an artificial intuition machine). In a previous article, I introduced a collection of loose coupling mechanisms found in the world of distributed computing.
The different methods that lead to loose coupling fall in to three general categories. These are mediation, decomposition and late-binding. These categories attempt to reduce signal dependency, computational dependency and temporal dependency respectively (Recall that information dynamics consists of signaling, computation and memory). We can examine the list methods in the mentioned article and see how these methods will fall in at least one of these three categories. The first kind is that you can decompose a single component into multiple components, allowing each subcomponent to perform work on different parts of a problem. The second kind is that you can place an intermediary component between two interacting components. Finally, there is the general method of late-binding.
An example of decomposition is that you can break the concept of a procedure in into two components, the action and the continuation. This results in a computing model that naturally supports asynchronous invocation and thus loose coupling. In Deep Learning, an example of decomposition, are the use of residual layers that effectively decomposes a single conventional layer into multiple incremental representation layers.
Mediation, or adding an intermediary is the most obvious form of decoupling and ensuring loose coupling. However, there are forms of this that are not as obvious. For example, rather than having two events occurring at the same time, allowing them to occur at different times and having an intermediary in the form of a correlation identifier can preserve the original semantics.
Finally, late binding is the subtlest form of loose coupling. That is when decoupling happens in time. For example, Prototype Oriented programming is a form of late binding classification. Delaying classification leads to more flexible systems. As we will see later, many forms of deferring commitment can lead to more loosely couple systems.
Classical Deep Learning networks are highly coupled constructions. These are trained end-to-end in a highly synchronized manner, where forward and back propagation phases are performed with high synchronization. The limitations of DL are also well documented. As compared to biological brains some of the limitations of DL systems are that they are forgetful, non-adaptive, unable to learn continuously, unable to learn from a few training samples and require a lot of energy. These limitations should prompt researchers to explore variations to the current classical formulation. I am specifically interested in modular DL systems that can be composed in a multi-agent like networks. In these kinds of networks, loose coupling is a essential feature.
The question then is, can we use loosely coupled principles to influence our future Deep Learning designs? In this post I shall show how the methods of decomposition, mediation and late-binding are used in the construction of some recently proposed Deep Learning networks.
Decoupled systems achieve greater generalization than monolithic DL systems. The evidence for this is in the development of Generative Adversarial Networks (GANs). These systems have competing neural networks that are able to perform impressive demonstrations of realistic image generation. StackGAN has shown that two decoupled adversarial networks working in combination can lead to state-of-the-art photo-realistic image generation:
Notice that there are four DNNs in the architecture above. Observe how the system is trained in stages where the first stage (in blue) training data has less information than the subsequent stage (in purple). This kind of decoupling, where the earlier stage is trained with partial information and not the entire data, allows for scaling up the capabilities to perform photo-realistic image generation from much larger images.
Maluuba have taken a decoupled approach to create a more scalable Deep RL system:
Decomposing a task into subtasks presents a number of advantages:
1. It can facilitate faster learning, because decomposition can split a complex task into a number of smaller tasks that are easier to solve. Deciding what coffee-shop to go to does not depend on whether there is currently a car approaching. However, deciding when to cross the road might depend on it. By splitting a task, each subtask can ignore information that is irrelevant to it.
2. It allows for the re-use of skills. The subtask of walking is a very general skill that can be re-used again and again.
3. It enables specialized learning methods for different types of subtasks. Controlling your muscles so that you walk in a straight line requires a different set of skills than deciding which coffee shop to go to. Having specialized methods for these subtasks can accelerate learning.
The interesting aspect of the Maluuba research is that the behavior coupling of the participant subnetworks are controlled by a reward function. So depending on the context of the environment, the subnetworks will either work in concert or work independently. Such a network learned is task three times faster than an equivalent tightly coupled network. A decoupled network has an attention like mechanism where the best trained expert for a specific kind of task may be called upon in different situations.
The major difficulty in multi-agent systems like this is the difficulty in achieving convergence. This is particularly problematic is the sub-networks form a cyclic graph. The Maluuba research breaks cycle dependencies by employing mediation (i.e. using intermediaries called “trainer agents”):
Learning with trainer agents can occur in two ways. The easiest way is to pre-train agents with their respective trainer agents, then freeze their weights and train the rest of the agents. Alternatively, all agents could be learned in parallel, but the agents that are connected to a trainer agent use off-policy learning to learn values that correspond to the policy of the trainer agent, while the behaviour policy is generated by the regular agents.
Progressive Neural Networks is another example of using intermediaries to decouple the network. The motivation for the decoupling however is to prevent catastrophic forgetting:
In the construction above, mediators are neural networks that perform domain adaptation with the purpose of leveraging transfer learning by reusing previously learned features of lower layers.
Late binding, sometimes referred to as lazy evaluation or in process models as deferred commitment, is a guideline that says that if you need to decide on an action, then that the decision can be deferred to the last possible moment. It is a kind of temporal decoupling wherein a request for action does not necessarily coincide with the action being executed immediately. In effect, it does assume an asynchronous behavior from the requester of the action.
An example of the use of late-binding can be found in the one-shot learning paper by DeepMind. The algorithm trains a network to discover the embedding where samples in the same class are close and samples in different classes are distant. Matching networks does this by defining a differentiable nearest neighbor loss involving the cosine similarities of embeddings produced by a convolutional network:
One area of Deep Learning architectures that isn’t researched enough is in the area where inference is performed. In conventional networks, the assumption is that the network has already learned the classification and therefore a single forward pass is needed to perform inference. This however may be requiring too much of a network and perhaps some kind of late-binding of contextual information may be needed for better generalization.
We find decomposition, intermediation and late-binding all working in concert in another of Deep Mind’s papers. The group proposes a “Synthetic Gradients” approach that decouples back propagation ( DeepMind has an update of this paper ). The method essentially inserts a proxy neural network in between layers to approximate the gradient descent:
This capability is valuable in complex multiple networks, acting in multiple environments at asynchronous and irregular timescales. Here is a high level description of the method:
At a high level, this can be thought of as a communication protocol between two modules. One module sends a message (current activations), another one receives the message, and evaluates it using a model of utility (the synthetic gradient model). The model of utility allows the receiver to provide instant feedback (synthetic gradient) to the sender, rather than having to wait for the evaluation of the true utility of the message (via backpropagation).
The synthetic gradient model decomposes the otherwise synchronized layers of a monolithic Deep Learning network. It does this by introducing a mediator between the layers to perform back propagation. What this mediator does (also a neural network) is to immediately respond with an estimate, however when the true error is received at a later time, updates its own estimate. One can thus look at this as a late binding of the true back propagation.
In this post, we begin to glimpse the value of understanding the known mechanisms for introducing loose coupling or low information coupling into DL architecture design. In fact, we can explore more advanced ‘model-free’ mechanisms, such as Discovery, Recognition, Learning, Abstraction, Adaptation, Evolution, Narrative, Consultation, Delegation and Markets, and discover that the preferred implementation is one that has low information coupling. It’s just logical, the lower the information requirements for a mechanism to perform its job, then it is more likely that the mechanism to be more successful.
https://openreview.net/pdf?id=rJY0-Kcll OPTIMIZATION AS A MODEL FOR FEW-SHOT LEARNING
We propose an LSTM based meta-learner model to learn the exact optimization algorithm used to train another learner neural network classifier in the few-shot regime.