Design by evolution

How to evolve your neural network. AutoML: time to evolve.

The gist ( tl;dr): Time to evolve! I’m gonna give a basic example (in PyTorch) of using evolutionary algorithms to tune the hyper-parameters of a DNN.

For most machine learning practitioners designing a neural network is an artform. Usually, it begins with a common architecture and then parameters are tweaked until a good combination of layers, activation functions, regularisers, and optimisation parameters are found. Guided by popular architectures — like VGG, Inception, ResNets, DenseNets and others — one will iterate through variations of the network until it achieves the desired balance of speed and accuracy. But as the available processing power increases, it makes sense to begin automating this network optimisation process.

In shallow models like Random Forests and SVMs we are already able to automate the process of tweaking through hyper-parameter optimisation. Popular toolkits like sk-learn provide methods for searching the hyper-parameter space. At its simplest form the hyper-parameter search is performed through a grid search over all possible parameters or random sampling from a parameter distribution (read this post). These approaches face two problems: a) waste of resources while searching on bad parameter region, b) inefficient at handling a dynamic set of parameters, hence it’s hard to alternate the structure of the solver (i.e. the architecture of a DNN). More efficient methods like Bayesian optimisation deal with (a) but still suffer from (b). It is also hard to explore models in parallel in the Bayesian optimisation setting.

More power!

While the idea of automatically identifying the best models is not new, the recent large increase in processing power make it more achievable than ever before. Especially if the type of models we want to optimise are computationally hungry (e.g. DNNs).

The time has come! And it’s so important that even Google decided to include it in its annual Google I/O (~16:00 min) conference, and I’m sure many others in the industry are doing too. It is already a spotlighted project in our team @ AimBrain.


Get in touch twitter.com/techabilly | offbit.github.io
Check out what we do in AimBrain @ http://aimbrain.com

Problem Setting

A way to think of hyper parameter optimisation, is to pose it as a meta-learning problem.

Can we make an algorithm that will explore which networks perform better at a given problem?

Note: Maybe meta-learning is a bit of term abuse, let’s not confuse it with approaches like learning to learn (e.g. this really interesting https://arxiv.org/abs/1606.04474 “Learning to learn by gradient descent by gradient descent” from Deepmind). So allow me to use the term meta-learning loosely.

Our goal is to define how many hidden layers (green) the network will have and the parameter of each one.

The goal is to explore both the architecture and parameter space of the model in order to optimise its performance in a given dataset. This problem by nature is complicated and the rewards are sparse. When we say that the reward is sparse we mean that we need to train a network to a sufficient point and then evaluate it; we only have some score when the train-evaluate cycle is done. This basically tells us how the whole system has performed. This type of reward is not a differentiable function! Remind you of something? Yes, it is a typical Reinforcement Learning scenario!

Quoting wikipedia on RL:

Reinforcement learning (RL) is an area of machine learning inspired by behaviourist psychology, concerned with how software agents ought to take actions in an environment so as to maximise some notion of cumulative reward.
Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).
Credits: Wikipedia

The agent in our scenario is a model and the environment is the dataset we use to train and evaluate. The interpreter is the process that analyses the reward of each episode and sets the state of the agent (or the policy, and in our setting it sets the parameters of the network).

Typically, reinforcement learning problems are modelled as a Markov decision process. The goal is to optimise the total reward of an agent and at each step you are called to take decision to either a) optimise the outcome given the model that you have created or b) explore a new action. The agent is forming a policy which improves the more it interacts with the environment.

Note: this topic is out of the scope of this post, the best introduction book on the topic is probably “Reinforcement Learning: An Introduction” by R.Sutton and A. Barto.

Evolutionary algorithms

An alternative approach to solving the reinforcement learning scenario is through evolutionary algorithms. Inspired by biological evolution, an evolutionary algorithm searches the solution space by creating a population of solutions. It then evaluates each solution and evolves the population according to the fitness(score) of each solution. Evolution involves selection and mutation of the most fit members of the population. As a result the population will evolve to increase its overall fitness and produce viable solutions to the problem.

The cycle of an evolution algorithm

The illustration on the left demonstrates the cycle of evolution. The two parts that are needed to design an evolutionary algorithm are a)the selection process and b)the crossover/mutation strategy that needs to follow.

Selection: the way parents are picked; common practice is to pick the k-best and some random individuals for diversity. More advanced selection techniques involve creating different subgroups of the population (usually referred to as species), then select the top-k among each of the species which helps to preserve a diversity in the solution space. Another popular method is the tournament selection where randomly selected individuals participate in a tournament play to define the winner (individuals selected for passing on their genes).

Crossover: The way two parents(or more) are mixed to produce an offspring. This is highly dependant on the way the problem is structured. A common approach is to describe each parent with a list of elements (usually numerical values) and then select random parts from each parent to compose a new list (genome). Read more

Mutation: the process of alternating the genome randomly. It’s a major exploration factor and helps maintain the diversity of the population. Read more

Implementation

The implementation uses PyTorch to create an agent that will explore DNNs for solving a simple classification task. MNIST is used for this experiment since its small and fast to train even on the CPU. We’ll create a population of DNN models and evolve it for N number of steps.

I assume that you already know the basics of what a DNN is and are familiar with PyTorch. If not, a good starting point for PyTorch can be found here on the official page.

The evolution theme is implementing the tournament selection method. The high level algorithm is this:

new_population = []
while size(new_population) < population_size:
choose k(tournament) individuals from the population at random
choose the best from pool/tournament with probability p1
choose the second best individual with probability p2
choose the third best individual with probability p3
mutate and append selected to the new_population

Side Note: The crossover problem. Crossover can be complicated when it comes to blending structures. How do you merge the architecture of two parents? How is this affected by training from scratch and finetuning scenarios? One of the recent papers from Miikkulainen et. al, tackle this issue with a very elegant solution called CoDeepNEAT. Based on the idea of Evolino, an architecture is composed of modules, and each module is subject to evolution itself. The architecture is a blueprint that combines components. In such setting it makes sense to mix the components of the parents since each component is a complete micro-network. For simplicity I opted not to include crossover in this implementation, but a technique like NEAT / CoDeepNEAT can definatelly be advantageous. (I’m planning to post a follow up post on these techniques.)

Basic building blocks

The first thing we need to define is the solution space for each model. Each individual represents an architecture. For simplicity we stack n-layers; each layer has three parameters: a) number of hidden units, b) activation type and c) dropout rate. As for the global parameters we choose between different optimizers, learning rates, weight decay and the number of layers.

Now that we have defined what is the space which our models live we need to create 3 basic functions:

  • randomise a network:

Initially, we randomly sample the number of layers and parameters of each layer. Sampled values fall within a predefined margin. When initialising a parameter we also generate a random parameter id. Currently it is not utilised, but one can keep track of all the layers. When a new model is mutated the old layers can be fine-tuned while initialising only the mutated ones. This should provide significant speed-ups and stabilise the solutions.

Note: Depending on the problem we might require different limitations, for example the total number of parameters, or total number of layers or FLOPs per cycle.
  • mutate a network:

Each network element has been assigned a probability of mutation. Each mutation will alter the parameter by resampling the parameter space.

  • make the network:

The above class will instantiate the model “genome”.


So now we have the basic buildings - how to create a random network, mutate its architecture, and train it. The next step is to create the genetic algorithm that will perform the selection and mutation of the best performing individuals. Every model is trained in parallel and doesn’t require any information sharing with the other agents. This enables the optimisation process to scale linearly with the amount of processing nodes available.

Code for the GP optimiser:

Evolutionary algorithms look super simple, right? Well it is! It can be very successful - especially if you define find good mutation / crossover functions for the individuals.

The repository includes some extra utlity classes like the worker class and scheduler class enable the GP optimizer do the model training and evaluation in parallel.

Running the code

Putting all of the above together and to run the experiment.

Let’s visualize some of the results!

This is the scoring of a population 50, with tournament size of 3. The models are trained only for 10000 examples and then evaluated.

At first glance it seems that the evolution is not doing much - solutions are near top performance from the first evolution. The maximum performance is reached in epoch 7. In the following figure we use a box plot. The box describes the quartiles of the population. We notice that most individuals perform well, but also that the boxes get tighter as the population evolves.

Left: distribution of solutons. Right: boxplot of the solutions per epoch.The box shows the quartiles of the solutions while the whiskers extend to show the rest of the distribution. Black dot is the mean value of solutions, we can notice an increasing trend.
A different evolution run.

To have a better understanding of the method’s performance it is good to compare it against a completely randomized population search. No evolution is performed between each epoch and the individuals are reset to a random state.

Left: distribution of solutons. Right: boxplot of the solutions per step of random generations.

The evolution performs better by a small margin(93.66% vs 93.22%), and while the random population search seems to produce some good solutions, the variance of the models is greatly increased. This means that resources are wasted on searching for suboptimal architectures. Comparing this with the evolution figures we see that the evolution clearly generates more solutions that are competent. It manages to evolve structures that consistently achieve higher performance.

A few things wroth mentioning:

  • MNIST is a fairly easy dataset - even 1 layer networks can achieve high accuracies.
  • Optimizers like ADAM are less sensitive to the learning rate; they find a good solution if the network has enough parameters.
  • During training the model sees only 10k examples (1/5 of the training). Good architectures might achieve even higher accuracies if we choose to train them for longer.
  • Limiting the number of examples also plays an important role for the number of layers we can learn - deeper model requires more examples. To counter this we also add a layer removal mutation to allow the population to regulate the number of layers.

The size of the experiment is not ideal to demonstrate the strength of such methods. Take a look at the related work following which points to papers using much larger experiments on harder databases.

Next Step

We just developed a simple evolutionary algorithm that implements a tournament selection theme. Our algorithm only considers the winning solutions and mutates them to create offsprings. The next step is to implement more advanced methods for generating and evolving the population. Some suggested improvements :

  • Reuse the weights of the parent for the common layers
  • Merge layers from 2 potential parents
  • The architectures don’t have to be sequential, you can explore more connection type between layers. (splits/merges/etc)
  • Add extra layers on top and do fine tuning

All of the above have been a subject of interest for the AI research field. One of the popular methods is NEAT (Neuroevolution of augmenting topologies) and its extensions. EAT variations use evolutionary algorithms not only to create the network but also to set the weights of it. Evolving the agent’s weights can be successful on a typical sparse reward-RL scenario. On the other hand when (x, y) input pairs are available, gradient descent methods perform better.

Related Work

This is just an introductory post to draw some attention into this very interesting approach of machine learning model exploration. I’ll link some papers of interest that describe what is state of the art in this domain.

*In no particular order


Get in touch twitter.com/techabilly | offbit.github.io
Check out what we do in AimBrain @ http://aimbrain.com

Code

You will need PyTorch to run the project.

Clone the project repository here: https://github.com/offbit/evo-design

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.