Large-Scale Evolution of Image Classifiers
Paper source: https://arxiv.org/pdf/1703.01041.pdf
Neural networks have been proven effective at solving difficult problems but designing their architectures is sort of challenging, even only for image classification problems. Evolutionary algorithms provided a technique to discover such networks automatically. Despite significant computational requirements, it is shown that evolving models that rival large, hand-designed architectures are possible today. We employ simple evolutionary techniques at unprecedented scales to discover models for the CIFAR-10 and CIFAR-100 datasets, starting from trivial initial conditions. To do this, we use novel and intuitive mutation operators to navigate large search spaces. We stress that no human participation is required once evolution starts and that the output is a fully-trained model. Throughout this work, we place special emphasis on the repeatability of results, the variability in the outcomes and the computational requirements.
In both academic research and industrial application, Neural Network has been proven its powerful capability. To solve different real-world problems, various network architectures could be constructed for specific tasks. However, current creation of effective architectures is still a hand-designed art. Hence, this paper proposed a novel approach to tackle the architecture optimization problem of neural network in image classification. The proposed method operates intuitive mutation in evolution algorithm to automatically adapt an optimal network architecture. Furthermore, the whole optimization process needs no human participation to obtain a full-trained model. In order to give an insightful look into this work, we introduce the details about applying Evolutionary Algorithm in Neural Network, and compare with previous published papers in same domain. Finally we give a brief evaluation on the overall quality of this work.
In the image classification domain, Neural Network is a rather successful classifier in multiple difficult tasks, as long as sufficient training data is available. However, those outperformed Network models are only attained through many years of research and analysis by researchers and engineers. Hence, designing neural network architectures in an automatic way becomes a popular topic. To achieve automated design of its architecture, one of the intuitive ways is about ‘Search’. We therefore apply the idea from Evolutionary Algorithm: inspired by biological evolution mechanism, the algorithm takes candidate solutions as individuals in a large population, then a fitness function is defined as a measurement to judge the quality of each of the candidate solutions. An iterative selection process is applied in which the relative optimal individuals are acted as ‘parents’ to operate reproduction, mutation and recombination to generate new offspring in population. After repeating the above operations, the next generation inherits advantages from its parents and mutation further offers diversity in population. Therefore, we are able to search an optimal solutions eventually.
To help optimize neural network, Neuro-evolution originally was used only to evolve the connection weights of neural network. Take the NEAT (Neuro Evolution of Augmenting Topologies) algorithm (Stanley & Miikkulainen) as an example, it takes three kinds of mutations such as modifying weight, adding weight connections between existing nodes, inserting node when split an existing connection. Other similar approaches also used it for hyper-parameter search. However, the aforementioned methods are applicable for rather small-scale networks compared to modern widely used Neural Networks (e.g. AlexNet, VGGnet). These traditional methods focused on efficiency of evolutionary process instead of scale, which is not appropriate for the state of art network optimization. On the other hand, non-evolutionary neuro-discovery methods are more suitable for current ‘Deep’ networks. So, other alternative methods such as Bayesian optimization and reinforcement learning, are utilized to optimize deep models, while their drawback is obvious that the number of network layers is still justified by researchers instead of by the algorithms themselves. Hence, the proposed work highlights its novelty in study of large-scale search of network architecture and the basic initial conditions, without any interruption of human. The experimental comparisons are shown as follows.
According to Table 1 and 2, the classification performance on C10+, C100+ under the proposed Evolution method surpassed both the hand-crafted approaches and other automatically discovered architectures, and require no post-processing stage.
In the following parts of this article, we will explain the proposed large-scale evolution methods in details.
Based on the working principle of Evolutionary Algorithm , the method treats a trained architecture as individual. Hence we are able to create population with multiple models and the accuracy on validation set is the fitness value. The paper proposes to use graph as the data structure to encode individual basic architecture. In this designed graph, the vertices represents rank-3 tensors, which is common in Convolution Neural Network: two dimensions are spatial coordinates of image while another tensor represents the RGB color channels. The edges of graph represent connections, convolutions or mutable parameters. Then we apply the evolution rules by weeding out low-fitness value pairs of models, and by choosing best ones as parents to reproduce new individuals. In order to increase diversity in population, during the reproduction process, we also apply mutation to selected parents of copy version. Then we repeat pairwise competitions of random individuals in the large search space to find ultimate optimal off-springs. As for specific scale achievement, a massively-parallel, lock-free infrastructure has been developed. Different computers operate asynchronously and communicate via a shared file-system, under which file directories denote individuals. The removal and garbage-collection are also considered to tackle with computing efficiency. Besides, reproduction process evolves the whole population. Mutation operations are also selected randomly from a predetermined set. The mutation operations include,
During this mutation process, all parameters yield a dense search space, which means no upper limit to any of the parameters. Hence, all model depths are also attainable. This unbounded nature of the parameters results in the exploration of a truly large set of possible architectures. In other words, both the parameters and architecture of neural networks are able to be evolved without human interruption.
After the theoretical background has been explained, we step to the initial setting and validation strategy claim for upcoming experiments. For initialization, it is well known that even a single trained convolutional neural network is a strong classifier, which may obtain a relative good accuracy on experiment. Thus the paper begins with experiment with a population of simple individuals. They contain no convolution and rather poor-performed on classification. At the same time it initializes learning rate as large number 0.1. Such setting forces the individual to learn and evolve strong classifiers, and also discover itself by mutation. Meanwhile the experiments avoid ‘rigging’ to a successful end.
Another strategy to speed up evolution is weights inheritance. Here, inheritance means that individuals can inherit part or all weights of their parents whenever possible. As for reporting methodology, each time it refers to as ‘The best model’. In addition to best models within one experiment, the one with highest validation accuracy also manages to choose ‘The best experiment’ among all experiments.
Beyond training and testing strategies, computation cost is another crucial aspect in the experiments. It is implemented by TensorFlow, in which the genetic mutation and convolution could be treated as TF (TensorFlow) operations. For each of these TF operations, we estimate the theoretical number of floating-point operations(FLOPs) required, given each individual, we assign the computation cost:
where F denotes FLOPs for both training and validation, E are running epochs, N denotes number of training and validation samples. Furthermore, computation cost of each experiment is sum of the costs of all its individuals.
The next step is implementing the stated algorithm above. Each experiment evolves a population in a few days, illustrated by the example in following figure:
After training phase, the proposed method obtains 94% accuracy on CIFAR-10 dataset using 9 x 10¹⁹ FLOPS. Then we apply the same neural evolutionary algorithm with their parameters on CIFAR-100, which gains 76.3% accuracy using 7 x 10¹⁹ FLOPS. As we discussed before, the results are quite competitive with other state-of-art hand-designed results on both datasets.
A statistical analysis is also made on the results within multiple experiments. The evolutionary progress is shown in the following figure:
During analysis, it concludes a rather large search space in order to search thoroughly to reach a better optimal solution. Meanwhile, the mutation rate is increased in a proper manner which also helps to escape the local minima. On the other hand, accuracy increases when the meta parameter T, which represents the training step numbers, increases. Besides, a larger training step means an individual needs to undergo fewer identity mutations to reach a given level of training.
Meanwhile other parameter setting tricks are also applied to do further exploration, whose performance is shown as follows.
To sum up, this work proposes a novel neural-evolution algorithm for large scale search. The novelty lies in it can handle with rather large neural networks (CNN etc). Using newly mutation strategies, the method turns out to be rather competitive good performance in classification task. Meanwhile, the trained models obtain good transfer ability (transfer from CIFAR-10 to CIFAR-100). However, the drawback is that it needs expensive computation cost, considering general application does not have multiple parallel high-performance computing machines for a specific task. Besides, another consideration is that, nowadays classification task can be easily completed by a single powerful CNN classifier. It seems the exhaustive search is not quite necessary because it may cost too much only for limited accuracy improvement. However, if this method would be extendable to adapt to multiple tasks, e.g. segmentation or detection, which have more space to improve, this attempt is quite a good start with high potential.
Author: Angulia Yang| Editor: Junpei Zhong| Localized by Synced Global Team: Zhen Gao