# Taxonomy of Methods for Deep Meta Learning

Let’s talk about Meta-Learning because this is one confusing topic. I wrote a previous post about Deconstructing Meta-Learning which explored “Learning to Learn”. I realized thought that there is another kind of Meta-Learning that practitioners are more familiar with. This kind of Meta-Learning can be understood as algorithms the search and select different DL architectures. Hyper-parameter optimization is an instance of this, however there are another more elaborate algorithms that follow the same prescription of searching for architectures.

Hyper-parameter optimization is a common technique used in machine learning. Daniel Saltiel has a short post on “State of Hyperparameter Selection” that covers Grid Search, Random Search (Bengio et. al)and Gaussian Processes as a way to sample out different machine learning models. These are standard techniques and is exploited by Facebook in their FBLearner platform:

Many machine learning algorithms have numerous hyperparameters that can be optimized. At Facebook’s scale, a 1 percent improvement in accuracy for many models can have a meaningful impact on people’s experiences. So with Flow, we built support for large-scale parameter sweeps and other AutoML features that leverage idle cycles to further improve these models.

There are many hyperparameters that one can chose from in the regime of Deep Learning architectures. A recent paper, “Evolving Deep Neural Networks” provides a comprehensive list of global parameters that are typically used in the conventional search approaches (i.e. Learning rate) as well as more hyperparameters that involve more details about the architecture of the Deep Learning network.

In the research, what is explored is an algorithm, CoDeepNEAT, for optimizing deep learning architectures through evolution. The claim is that their evolution inspired approach is, five times to thirty times speedup over state-of-the-art Bayesian optimization algorithms on a variety of deep-learning problems. The approach used is based on idea called the SuccessiveHalving algorithm. The algorithm uniformly allocates a budget to a set of hyperparameter configurations, it evaluates the performance of all configurations, then throws out the poorest performing half, and then repeat until one configurations remains.

Two recent papers that were submitted to ICLR 2017 explore the use of Reinforcement learning to learn new kinds of Deep Learning architectures (“Designing Neural Network Architectures using Reinforcement Learning” and “Neural Architecture Search with Reinforcement Learning”).

The first paper describes the use of Reinforcement Q-Learning to discover CNN architectures, you can find some of their generated CNNs in Caffe here:https://bowenbaker.github.io/metaqnn/ . These are the different parameters that are sampled by the MetaQNN algorithm:

The second paper (Neural Architecture Search) employs uses Reinforcement Learning (RL) to train a an architecture generator LSTM to build a language that describes new DL architectures. The LSTM is trained via a policy gradient method to maximize the generation of new architectures. The research explores this method to generate convolutional architectures (CNN) as well as generating recurrent architectures (RNN). For CNN generation the following parameters were used:

and for the RNN generation the following was used:

The trained generator RNN is a two-layer LSTM, this RNN generates an architecture that is trained for 50 epochs. The reward used for updating the generator RNN is the maximum validation accuracy of the last 5 epochs cubed (see paper). After the the RNN trains 12,800 architectures (Google has this luxury), then select the candidate architecture that achieves the best accuracy in validation. Run the candidate architecture is then run through hyperparameter optimization (i.e. grid search) to find the best performing instance.

Here is an example of LSTM cells that were generated by this system:

Clearly incomprehensible by most humans. Mind blowing and state-of-the-art.

An even more recent paper (“Large-Scale Evolution of Image Classifiers”) also from Google Brain employs an evolutionary algorithm using mutation operators that are inspired by “rules of thumb” coming from various DL papers. The evolution algorithm uses repeated pairwise competitions of random individuals, and select from the pair the better performing individual (i..e. tournament selection). The set of mutation operators that the authors used are as follows:

- ALTER-LEARNING-RATE .
- IDENTITY (effectively means “keep training”).
- RESET-WEIGHTS (sampled as in He et al. (2015), for example).
- INSERT-CONVOLUTION (inserts a convolution at a random location in the “convolutional backbone”. The inserted convolution has 3 × 3 filters, strides of 1 or 2 at random, number of channels same as input. May apply batch-normalization and ReLU activation or none at random).
- REMOVE-CONVOLUTION.
- ALTER-STRIDE (only powers of 2 are allowed).
- ALTER-NUMBER-OF-CHANNELS (of random conv.).
- FILTER-SIZE (horizontal or vertical at random, on random convolution, odd values only).
- INSERT-ONE-TO-ONE (inserts a one-to-one/identity connection, analogous to insert-convolution mutation).
- ADD-SKIP (identity between random layers).
- REMOVE-SKIP (removes random skip).

This “Neuro-Evolution” approach has some interesting result conclusions. Specifically that it is capable of “constructing large, accurate networks” and “the process described, once started, needs no participation from the experimenter.” The impression by some about this paper is that it could have only be executed by the folks at Google with “computation at unprecedented levels”. Clearly this is a brute force approach that can be refined in a neural network were trained to emit these rules rather than through random evolution. A very interesting discovery in this network is that some of the better performing networks simply stacked convolution networks on top of each other without non-linearities. This defies convention, however there may in fact be some justification to it! (Note: another paper with a similar genetic algorithm)

In Neural Architecture Search, an LSTM was trained to generate architectures. We can think of Meta-Learning in a general sense as being the “machines that generate other machines”. Hyper Networks also fall into the category of machines generating other machines. Hyper Networks are DL architectures that learn how to generate the weights of another DL architecture. The claimed utility of the approach relates to it being a kind of generalization of weight sharing. Unlike the other methods mentioned above, both Hyper Networks and Learning to Learn, a neural network replaces the entire functionality rather than specifying hyper-parameters:

Note that Transfer Learning is similar to both of these in that another network provides the weight initialization. In addition, these are only for single instances and not generative. Perhaps one could use a generative technique (i.e. VAE, GANs etc) to generate optimal DL architectures.

In all the above approaches, the method employs different search mechanisms (i.e. Grid, Gaussian Processes, Evolution, Q-Learning, Policy Gradients) to discover (among the many generated architectures) better configurations. The key idea to emphasize is that hyper-parameters can be generalized into a form of a language. A Domain Specific Language (DSL) with the above hyper-parameters as its vocabulary can serve as a general basis for generating new architectures.

It is instructive to understand that DSLs are a meta-model. There is however an even more abstract level, that is the meta meta-model. I wrote about this in “The Meta Model and Meta Meta-Model of Deep Learning”. The key question of that post was that it was obvious as to the best vocabulary for the meta meta-model. The meta meta-model specifies the kind of hyper-parameters available at the meta-model. We can clearly have a language that specifies down to the description of every neuron. However, we want to strike an effective balance of abstraction. There is also an entire spectrum as to what is mutable and trainable.

The delineation of what happens in hyper-parameter optimization and what happens while training a DNN is actually somewhat arbitrary. We actually see this in the newer architectures like DeepMind PathNet, also a architecture search method, where hyper-parametrization and training are all in the same network (or fabric ). PathNet re-uses lower layers (effectively exploring initialization schemes) and explores multiple ways to connect a network.

We have a glimpse of a DSL driven architecture in my previous post about “A Language Driven Approach to Deep Learning Training” where a prescription that is quite general is presented. Specifically, (1) define a DSL, (2) generate data from the DSL, (3) train a DL network from the samples then (4) use the network to guide a search algorithm to find better solutions. In that post, I described the DSL as being something that can be applied to any domain. In this article, it is clear that the same program induction inspired approach can be used to searching for Deep Learning architectures.

Here’s are summary of the different Deep Meta Learning methods discussed here.

The variety clearly corresponds to the kind of meta-data that is manipulated to generate simulated architectures. In summary, current meta-learning capabilities involve either support for search for architectures or networks inside networks. The former is an established technique to automatically explore better architectures and the latter performs well in automatically fine-tuning algorithms.