The high costs of meta-learning
And how to address them
--
Meta-learning is a growing research axis and, although it is currently still in an early stage, impressive results have already been achieved. For example, it has been shown that meta-learning can outperform the best-handcrafted neural networks in many domains, e.g. image classification and object detection. However, one of the biggest concerns with meta-learning today are the required high-performance computational resources. It is often already very costly to train a single model not to speak of training a full ensemble of different algorithms and parameters.
Meta-learning is expensive
While it might be feasible to pay this high cost on big data centers, it makes it especially hard to apply meta-learning on commodity hardware. Moreover, it is already difficult to even anticipate how much computational resources a certain learning algorithm will require, which often leads to either substantially over- or undersized resources. As a consequence, the available computational resources are often not optimally utilized. Last but not least, it can avoid that an algorithm is running for hours or days before stopping with an out of memory exception — an unfortunate scenario that happens with many learning frameworks today.
Various different meta-learning frameworks have been developed during the last few years, both open source and commercial ones: AutoKeras, AutoGluon, Auto-WEKA, TPOT, and Google’s AutoML to just name a few. Lots of progress has been made, especially for the search algorithm (meta search) itself. However, scalability problems and the high computational costs remain big challenges.
How to make meta-learning cheap(er)?
At the core, meta-learning is about exploring many different model configurations. If we look at it from a high-level, ideally, we should just have to specify the maximum allowed computational resources (GPU- and main memory) and the maximum allowed time and then, when the time has elapsed, get the best model which has been found respecting these constraints. Thus, the goal must be to maximize the number of models that can be explored on given hardware resources in a given time.
Obviously, the meta search itself also has to be optimized, i.e. models shouldn’t be explored blindly. Different approaches, e.g. based on random search, evolutionary algorithms, and reinforcement learning, have been successfully applied to reduce the search space. This is a full domain by itself and is not subject of this article. Nonetheless, at the end, meta-learning comes down to exploring as many models as possible. The more models we can explore in a given time on given hardware resources, the more likely it is that we will find a better model.
Meta-learning using the CalcGraph
In a previous blog post, we presented CalcGraph, a computational graph model specifically designed for meta-learning. We discussed ideas on how to make meta-learning efficient, in the sense that it can be executed on commodity hardware and optimally utilizes the available resources, and predictable, in the sense that the used computational resources can be defined beforehand the execution. In principle, all settings that are needed for a meta-learning execution are: the amount of memory to allocate and amount of total search time.
Based on this information, we can automatically explore as many model configurations as possible till the maximum allowed search time has been reached. Then, the “best” model found within this time frame gets selected. The following figure depicts a meta-learning meta-model.
This meta-model is a refined version (targeted for deep learning) of a previous post, where we discussed a generic meta-learning meta-model. In the following, we describe its main concepts.
As its name suggests, the CalcGraphExplorer is responsible for exploring the search space, i.e. different CalcGraph configurations. It can be seen as the controller or scheduler of the meta-learning process. The explorer can be configured with the maximum allowed time for the exploration, the maximum allowed memory to allocate, the maximum number of models to train in parallel, and the train/test/cross-validation split ratio. These parameters are called meta-learning parameters. As can be seen in the figure, the explorer has three relations to the CalcGraph models that it explores: one for the already explored models, one for the models currently in exploration, and one for the models that still need to be explored (prioritized in a queue).
The explorer uses a TimeManager to determine how much time should be allocated for each CalcGraph model and when the model has to be switched in order to execute another one. Depending on the number of models that the CalcGraph explorer should explore in parallel, it allocates one or several CalcHeaps for the execution of the models. The fact that we know exactly how much memory each model will take allows us to easily execute models in parallel, thus be able to utilize the available resources optimally.
The BoardScore maintains the score of the different CalcGraph models, each model has its own score. Models have complex scores taking the time consumed, memory consumed, train score, test score, and cross validation score into account. In accordance to multi-objective optimization problems, the board score is organized as a Pareto front of the different models. The CalcGraph explorer can compare the score of the currently running model against previously running ones. This shows how promising the current solution is and allows for example to stop the exploration of a model in case a better model has already been found.
The DatasetMetadata contains statistical information about the data set, like its size, dimension, etc. and its features. Features can be derived features or regular features. Each hyperparameter uses a DrawingDistribution, which specifies which values the explorer can choose for a certain parameter. For example, a random number between a minimum and a maximum value. It could also offer options like to draw from a range linearly, like for the learning rate which is often changed step-wise by factors of ten, or to draw from a range exponentially, etc. The drawing distribution can also be a fixed value, indicating the controller to not modify this value. The concrete realizations of each of these elements can be arbitrarily complex and are research axis on their own.
The search process
As mentioned before, the search process itself can be quite complex. Currently, we use a simple grid search to explore the search space with the same density all over. This is not optimal in terms of pure optimization, since the most promising search regions should be explored more than the least promising regions. Nonetheless, this relatively simple algorithm is already able to achieve good results. In the future, we plan to improve the search algorithm in the following manner: as a first generation of models to explore, they will be generated randomly in a uniform manner from the search region (like it is already the case right now). After executing each model, new models will be spawn around the same region as the parent model. However, the difference is that the priority to explore these new models will be set equal to the score of the parent that generated them. This way, the most promising regions will have their priorities higher than the models coming from the least promising search regions. Since the queue has a maximum size limit, naturally the least priority models will get deleted from the queue, thus a process of elimination similar to what happens in genetic algorithms. Several other ways of generating models exist, this topic by itself is an independent research axis.
The meta-learning process
In case enough computational resources (GPU/RAM/Grid computing) are available, the controller can take advantage of that to run several CalcGraph models in parallel. For the sake of simplicity, let us consider we have a hardware configuration that is able to run eight CalcGraph configurations (models) in parallel. The controller then first creates eight CalcHeaps to be linked dynamically with the eight running instances and a queue of CalcGraph configurations to try next. In this example, eight instances can be evaluated in parallel, and an independent task can be running to generate and queue new configurations to explore next. Whenever an instance finishes, due to learning convergence, or due to a timeout, the CalcHeap is dissociated from the previously running instance, and it is linked to the highest priority of CalcGraphs waiting in the queue to be executed. This way, a model switch can happen with very limited overhead in terms of memory and resource allocation. This is even more relevant in the case of using several GPUs or a grid computing, where resource allocation has a high latency. At the very end of the meta-learning phase, when the total time runs out, all eight allocated CalcHeaps are freed and only the Pareto front of the best configurations are presented.
Acknowledgement
This work is supported by the Luxembourg National Research Fund, FNR (https://www.fnr.lu/), under the Industrial Fellowship program (grant 12490856).
If you enjoyed reading, follow us on: Facebook, Twitter, LinkedIn