Deconstructing Deep Meta Learning

This article explores in more detail the idea of Meta Learning that was previously introduced in a post “The Meta Model and Meta Meta Model of Deep Learning”. In this post, I explore “Learning to Learn” as a Meta Learning approach. We have to be very careful to distinguish between Learning to Learn and Hyper Parameter Optimization (HPO). HPO and more generally searching for architectures differs from “learning to learn” in that that HPO explores the space of architectures while meta-learning explores the space of learning algorithms.

Meta-learning is all the rage in research these days. This is not unexpected, since such a capability essentially is way for the automation that we create to bootstrap their capabilities. What I want to explore today is, how can we better reason about meta-learning and therefore have a sense on how to apply it in the construction of our solutions.

Meta-level constructs are ubiquitous in the languages we use for software development. It’s a concept that many computer scientists are comfortable with. However, when you transition in this world of deep learning. That is a world of information dynamics (aka computational mechanics), then it becomes a lot murkier. An explanation may be that we don’t expect to encounter meta-level constructs in the inanimate world of physics. We don’t expect to find constructs that seem to be a specification or blueprint of other constructs.

Yet, we do see a meta-level construct in biology. DNA are meta-level constructions. DNA are specifications or blueprints that guide the replication of cells. The computational mechanism exists for all to see, so it is not something that we can ignore. DNA is like a long term memory that captures the instructions required for recreating biological systems that transcends their expiration. With brains and neurons, memories don’t transcend death.

The question I have is, what’s the meta-level construct for biological brains? What is acting as the specification or blueprint for the storage of behavior? Does such a construct need to exist? We may derive some inspiration how different programming languages express meta-level concepts. In object oriented languages like Java, C++ or Smalltalk, classes are meta-level. However, languages like Lisp and Scheme have their uniqueness is that there’s no need for meta-level constructs. Data and code are one in the same thing.

In short, so long as there is a mechanism for memory and a mechanism to alter behavior based on that memory, then is sufficient to serve as a meta-level construct. So, Read-Eval-Print (REPL) loop is all nature needs to go meta. The ability to interpret stored memories as instructions is the key requirement. Meta-level constructs are just intellectual affordances or conveniences that we as humans create to better understand the programs that we write.

Now we can have programs that interpret instructions but are not meta-level. When does meta-level processing occur? This happens when the outputs of a process is used as instructions to another process. Does that not obviously describe “communication”? Robin Milner wrote in “Turing, Computing and Communication”:

So we must find an elementary model which does for interaction what Turing’s logical machines do for computation. … there is a logic of informatic action, and in my view it is based upon two fundamental elements:
Synchronised action and Channel, or vocative name
I ask you to think of the term “information” actively, as the activity of informing. An atomic message, then, is not a passive datum but an action which synchronises two agents.

For communication to be successful, the receiving end needs to interpret the communication and therefore this implies the existence of interpretative language. The communication must mean something for the receiver. That meaning can be as simple as a true or false statement or to some more complex structured conveyance of knowledge. In short, what is communicated must be a language.

However, what do Deep Learning systems do? Is there language interpretation happening underneath the covers? Each layer in a Deep Learning system develops a distributed representation that is a kind of an internal language. It learns a map of an input representation into an output. That is, it learns a translation. Is translation the same as interpretation? Translations are just intermediate steps in a sequence of steps to arrive at interpretation. Interpretation here means the process that takes information that leads to a decision.

The utility of language arises primarily from is share-ability. However, Deep Learning systems, that create internal languages, are uninterpretable and likely difficult to share. We can of course train other DL system to learn these internal languages. Encoder-Decoder architectures perform this kind of work. Examples of this are cross language translation and image captioning. So, we do see some kind of ‘language sharing’ between an encoder and decoder. However, to build more complex systems, we perhaps need to understand how to build shareable DL representations.

Languages that are at the meta-level are like DNA. They provide instructions on how to recreate other machines. Meta-level machines use metal-level instructions to create other machines. So a machine that performs meta-learning is able to learn the language of creating other machines. To better understand how this works in the wild, let’s explore some recent research in the field.

Learning to learn by gradient descent by gradient descent trains an LSTM based optimizer to learn a variant of the gradient decent method. In this research, the LSTM was trained on a single layer network of 20 hidden units against the MNIST dataset. The trained optimizer was then tested on a larger network of 40 hidden units, a network with 2 layers, and one with ReLU instead of the original activation function. The trained optimizer performed well for the first two and not the third. The trained optimizer was tested on different datasets (i.e. CIFAR), and in all cases performed well. The trained LSTM optimizer generalized for larger networks as well as similar datasets. In the framework of meta-level languages, the LSTM learned the language was applicable to different bus similar enough problems but not radically different network architectures.

One key limitation of training of the meta-learner is that the input training data will be prohibitively large because training will potentially require observation of not only the optmizee’s data set, but also all its weights. So, to make this approach feasible only a subset or approximation of the entire potential input space is used to train a meta-learner.

Learning to reinforcement learn trains an LSTM in the context of learning a Reinforcement Learning (RL) algorithm. A teacher RL algorithm was used to train the LSTM. The result of this was that the LSTM learned an algorithm that was (1) different from the teacher RL and (2) biased toward the environment where it was trained. The consequence was that this LSTM based algorithm was more efficient than a general purpose algorithm:

Critically, this learned RL algorithm is tuned to the shared structure of the training tasks. In this sense, the learned algorithm builds in domain-appropriate biases, which can allow it to operate with greater efficiency than a general-purpose algorithm.

Optimization as a model for few-shot learning trains an LSTM in the context of a “few-shot” learning problem.

In the algorithm described in this paper, a LSTM is trained to be the update function for the optimizee network. The update of the optimizer is after each batch and its loss function is calculated against test data. This indeed is an interesting configuration, in that typically training usually isolated from being evaluate against test data. Here however, the optimizer is aware of its performance relative to test data.

An interest paper titled Meta Networks (MetaNet)came out just recently (March 2017) with an intriguing claim to be able to “acquire meta-level knowledge”. The architecture of the MetaNet is depicted as follows:

The “meta-info” that is studied in the paper is the loss gradient and the system was evaluated with state-of-the-art results in the regime of one-shot learning. A key constraint, that we have to emphasize here, is that a meta-learner can only learn from the data (or meta-info) that it is trained on. It is unclear to me why the loss gradient is sufficient information, however let’s give the paper the benefit of the doubt that it appears to work. I suspect however, for meta-learning to scale, one needs to find a subset of information from the base data to train on. It will be interesting to see as to what subset researchers will be looking at in future papers. (Note: A recent paper by OpenAI seems to confirm this gradient only approach)

Meta-learning (i.e. Learning to Learn) is just one kind of meta-process. Other researchers have worked on “Learning to Optimize”, “Learning to Compose”, “Learning to Plan”, “Inverse Learning” and “Learning to Communicate”. This is just a partial list of capabilities, that in a later post I will attempt to provide a unifying conceptual framework. (Note: See Design Patterns for Deep Learning for more)

However, despite the intriguing potential that Meta-Learning may perhaps bootstrap itself and recursively improve its behavior by learning on to itself. In the 3 research papers I surveyed above, it is clear that there is an major obstacle that can’t be avoided. That obstacle is the requirement for training data.

Other meta-learning approaches (i.e. Hyper-parameter optimization and architecture search) work from meta-data. Hyper-parameter optimization uses meta-data on things like learning rate. Approaches from Google and MIT that search for different architectures use meta-data in the kinds and connection of different layers. These methods can synthesize new data by simple generation and simulation. Learning to learn however does not use meta-data, but rather observes the instance behavior and derives meta-data.

Meta-learning obviously cannot learn what is has not previously seen. So there is a need to learn a learning algorithm that works across multiple architectures or even different domains. Without an ability to generate new training data, “learning to learn” by itself is unable to infinitely improve on itself. However, what “learning to learn” is able to do is improve itself within a given context. Unfortunately, it cannot learn a meta-language that has independence from context.

This representations that are learned by Deep Learning systems have context intertwined in them. We do not know of systematic ways of removing context. This is related to models that have invariances or to representations that have removed all nuisance variables. The ideal generalized representation is one where context has been removed. That is, a system should be able to recognize a cat independent of lighting, shadows, angle or occlusion:

The additional consequence of context independent languages is that they are shareable in different contexts. That is, we can combine the context on demand with the internal representation to perform the prediction. Ultimately, for “learning to learn” to be successful, it is required that it is able to “learn a language independent of context” (model agnostic?).

Please read “Machines that Search for Deep Learning Architectures” for a taxonomy of Meta-Learning methods.

Strategy for Disruptive Artificial Intelligence: