The Emergence of Modular Deep Learning
Deep Learning compared to other Machine Learning methods is remarkably modular. This modularity gives it unprecedented capabilities that places Deep Learning head and shoulders above any other conventional Machine Learning approach. Recent research however is pointing to even greater modularity than previously. It is likely that quite soon, monolithic Deep Learning systems will become a thing of the past.
Before I discuss what is coming in the future, let me first discuss the concept of modularity. It is a concept that software engineering is familiar with, but the idea is not as commonly found in machine learning.
In computer science, we build up complex systems from modules. One module built from more simple modules. It is what enables us to build our digital world based on just NAND or NOR gates. Universal Boolean operators are a necessity, but they are not sufficient enough to build complex system. Complex computing systems require modularity so that we have a tractable way of managing complexity.
Here are six core design operators that are required to be supported in a modular systems:
- Splitting — Modules can be made independent.
- Substituting — Modules can be substituted and interchanged.
- Augmenting — New Modules can be added to create new solutions.
- Inverting — The hierarchical dependencies between Modules can be rearranged.
- Porting — Modules can be applied to different contexts.
- Excluding — Existing Modules can be removed to build a usable solution.
These operators are of a general nature and inherent in any modular design. They allow the modification of existing structure into new structures in well-defined ways. In the context applied to software this can mean refactoring operators at the source code level, language constructs at specification time or can mean component models at configuration time. These operators are complete in that they are capable of generating any structure in computer design.
The six operator definition focuses on functional invariance in the presence of design transformations. Said more clearly, we can apply these operators and not affect the function of the whole.
In the context of deep learning, the modularity operators are enabled as follows:
- Splitting — Pre-trained autoencoders can be split and reused as layers in another network.
- Substituting — Through transfer learning, student networks can serve as substitutes of teacher networks.
- Augmenting — New networks can be added later to improve accuracy. You can joint train networks to improve generalization. Furthermore, outputs of a neural network can be used as neural embedding that can be used as representations for other neural networks.
- Porting — A neural network can be “ported” to a different context by replacing the top layers. This works is cases were the domains are similar enough. There still needs to be more research in Domain Adaptation to understand the boundaries of this method.
The two remaining modularity operators are not available to current monolithic DL systems.
- Inverting — Layers in the network are not rearrangeable without catastrophic consequences. The layers of a monolithic DL system are too tightly coupled to allow for this.
- Excluding — There is no mechanism to “forget” or to exclude functionality from a monolithic DL system.
Nevertheless, despite these two short comings, DL systems are unmatched advantage over competing machine learning techniques.
One of the reasons for the tight coupling of layers in a monolithic DL system can be traced back to Stochastic Gradient Descent (SGD). SGD works in lock-step mode with training. It is a very highly synchronized mechanism that requires behavioral coordination across all layers.
This monolithic construction is however being replaced by an even more modular system. Here are two interesting developments related to this.
DeepMind has researched a method called “Synthetic Gradients” that shows a way towards more loosely coupled layers. The method essentially inserts a proxy neural network in between layers to approximate the gradient descent:
The second development that can lead to greater modularity is the concept of generative adversarial networks (GANs). Typical GANs have two competing neural networks that are essentially decoupled, but contribute to the global objective function. However, we are now seeing in research the emergence of more complicated configurations like this:
Where you have a ladder like network of decoupled encoder, generators and discriminators. The prevalent pattern here is that every conventional function in a neural network has been also replaced by a neural network. More specifically, the SGD algorithm and the objective function have been themselves been replaced by neural networks. Gone are any analytic functions! This is what happens when you have Deep Meta-Learning.
Another very impressive result that was recently published with also the same name (i.e. StackGAN), shows how effective multiple decoupled GANs can be:
The task here is to take as input a text description and generate an image corresponding to the description. Here we have two GANs staged one after the other. The second GAN is able to refine the fuzzy image into one of a higher resolution. Modular networks have the capability of factorizing capabilities that would otherwise be entangled in an end-to-end network.
In software engineering we have the concept of APIs. That is, a restrictive language that communicates between different modules. In the scenario above neural networks that “learning to communicate” acts as the bridge APIs between the modules. More generally, we have networks that “learn how to interface”. From “Learning to Communicate with Deep Multi-Agent Reinforcement Learning”:
We consider the problem of multiple agents sensing and acting in environments with the goal of maximising their shared utility. In these environments, agents must learn communication protocols in order to share information that is needed to solve the tasks. By embracing deep neural networks, we are able to demonstrate end-to-end learning of protocols in complex environments inspired by communication riddles and multi-agent computer vision problems with partial observability.
Another recent paper titled “Generative Adversarial Parallelism” explores this further in relationship to GANs. . In this work, the authors attempt to address the difficulty in training GANs by extending the usual two player generative adversarial games into a multi-player game. They train many GAN-like variants in parallel and while doing so, they periodically swap around the discriminator and generator pairs. The motivation here is to achieve a better decoupling between the pairs. There is still much work to be done to determine if if decoupled interfaces between networks lead to be better generalization.
We argue that this (the swapping) reduces the tight coupling between generator and discriminator and show empirically that this has a beneficial effect on mode coverage, convergence, and quality of the model .
There’s still a ton of research to be done to begin to understand how to build the APIs for DL modules. Right now however, we have the benefit of meta-learning techniques that will automatically learn the interface specification. This indeed is a very interesting research topic. How do decoupled networks learn how to interface?