The Meta Model and Meta Meta-Model of Deep Learning
The model for deep learning consists of a computational graph that are most conveniently constructed by composing layers with other layers. Most introductory texts emphasize the individual neuron, but in practice it is the collective behavior of a layer of neurons that is important. So from an abstraction perspective, the layer is the right level to think about.
Underneath these layers are the computational graph, it’s main purpose is to orchestrate the computation of the forward and backward phases of the network. From the perspective of optimizing the performance, this is an important abstraction to have. However, it is not at the ideal level to reason how it all should work.
Deep Learning frameworks have evolved to develop models that ease construction of DL architectures. Theano has Blocks, Lasagne and Keras. Tensorflow has Keras and TF-Slim. Keras was originally inspired by the simplicity of Torch, so by default has a high-level modular API. Many other less popular frameworks like Nervana, CNTK, MXNet and Chainer do have high level model APIs. All these APIs however describe models. What then is a Deep Learning meta-model? Is there even a meta meta-model?
Let’s explore first how a meta-model looks like. A good example is in the UML domain of Object Oriented Design. This is the UML metal model:
This makes it clear that Layers, Objectives, Activations, Optimizers, Metrics in the Keras APIs are the meta-models for Deep Learning. That’s not too difficult a concept to understand.
Conventionally, an Objective is a function and an Optimizer is an algorithm. However, what if we think of them instead as also being models. In that case we have the following:
This definitely is getting a whole lot more complicated. The objective function has become a neural network and the optimizer has also become a neural network. The first reaction to this is, has this kind of architecture been tested before? It’s possible someone is already writing this paper. That’s because an objective function that is a neural network is equivalent to the Discriminator in a Generative Adversarial Network (GAN) and an Optimizer being a neural network is precisely what a meta-learner is about. So this idea is not fantastically out of mainstream research.
The second reaction to this is, shouldn’t we make everything neural networks and be done? There are still boxes in the diagram that are still functions and algorithms. The Objective’s optimizer is one and there are 3 others. Once you do that, there’s nothing else left that a designer needs to define! There are no functions, everything is learned from scratch!!
So a meta-model where everything is a neural network looks this:
Where the mode is broken apart into 3 parts just for clarity. Alternatively, it looks like this:
What this makes abundantly clear however is that the kinds of layers that are available come from a fixed set (i.e. fully connected, convolution, LSTM etc.). There are in fact research papers that exploit this notion of selecting different kinds of layers to generate DL architectures( see: “The Unreasonable Effectiveness of Randomness” ). A DL meta-model language serves as the lego blocks of an exploratory RL based system. This can generate multiple DL meta-model instances to optimize for the best architecture. That is a reflection of the importance of Deep Learning Patterns. Before you can generate architectures, you have to know what building blocks are available for exploitation.
Now, if we make a quantum leap into meta meta-model of Deep Learning. What should that look like?
Let’s look at how OMG’s UML specification describes the meta meta-model level (i.e. M3):
The M3 level has a simplified structure that only includes the class. Following an analogous prescription, we thus have the meta meta-model of Deep Learning defined by the following:
Despite the simpleness of the depiction, the interpretation of this is quite interesting. You see, this is a meta object that an instance of which is the conventional DL meta-model. These are the abstract concepts that define how to generate new DL architectures. More specifically, it is the language that defines the creation of new DL models such as a convolution network or a autoregressive network. When you work at this level, you essentially generate new kinds of DL architectures. This is what many DL researchers actually do for a living, designing new novel models.
There is one important concept to remember here though, the instance, model, meta-model and meta meta-model distinction are concepts that we’ve invented to better understand the nature of language and specification. This concept that is not essentially and likely does not exists in separate form in reality. As an example, there are many programming languages that do not have a distinction between instance data and model data. Languages like Lisp are like this, where everything is just data, there is not distinction between code and data.
The idea of “code is data” applied to DL is equivalent to saying that the DL architecture are representations that can be learned. We as humans require the concept of a meta meta-model to get a better handle of the complex recursive self-describing nature of DL systems. It would be interesting know what the language of the meta meta-model should look like. Unfortunately, if this language is one that is learned by a machine, then it may likely be as inscrutable as any other learned representation. See: “The Only Way to Make DL Interpretable”.
It is my suspicion though that this meta meta-model approach if pursued in greater detail may the key in locking “Unsupervised learning” or alternatively “Predictive learning”. Perhaps our limited human brains cannot figure this out. However armed with meta-learning capabilities, it may be possible for machines to continually self improve upon themselves. See “ Meta-Unsupervised-Learning: A supervised approach to unsupervised learning” for an early take on this approach.
The one reason that this may not work however is that the vocabulary or language that is the is limited (see: Canonical Patterns) and therefore “predictive learning” is not derivable from this bootstrapping method. Meta-learners today discover can only the weights and the weights are just parameters of a fixed DL model. A discovery, even through evolutionary methods, can only happen if the genesis vocabulary is at the correct level. Evolution appears to be a Meta Metal-Model process.
There is plenty that is missing in our understanding of the language for the meta meta-model of DL. Perhaps we can discover this only if we work up the Capability levels of Deep Learning intelligence. DARPA has a program that is researching this topic “DARPA goes ‘Meta’ with Machine Learning for Machine Learning”. I hope to refine this idea over time.