Auto is the new black — Google AutoML, Microsoft Automated ML, AutoKeras and auto-sklearn
Motivation: Life is hard
Achieving state-of-the-art performance in a given data set is hard. It usually implies carefully selecting the right data pre-prossessing tasks, picking the right algorithm, model and architecture and pairing it with the right set of parameters. This end-to-end process is usually called Machine Learning Pipeline. There is no rule of thumb in which direction to go and, with more models beings developed all the time, even picking the right model is becoming challenging. Hyper-parameter tuning usually requires walking or sampling over all the possible values and just trying them out. However, there is no any warranty about finding something useful. In this context, automating the selection and tuning of machine learning pipelines has long been one of the goals of the machine learning community. This kind of task are usually referred as meta-learning — learning about learning.
It also seems it’s been in our landscape for the beginning of the times. A funny fairy tale…
Once upon a time, there was a sorcerer who trained models in a framework that does not longer exist, in a programing language that nobody longer codes in. One day, an old man asked him to train a model for a mysterious dataset.
The sorcerer tried to train the model in thousands of different ways without luck. He went to his library looking for a way out and he found a book about a special spell. A spell that could send him to a hidden realm where everything was discovered, every possible model was tried and every optimization technique was done. He cast the spell and was sent there. He then saw exactly what to do to get a better model. And so he did. Before returning, he could not resist the temptation of bringing all such power with him, so he conferred all the wisdom of the realm into a stone he called Auto. And so he returned. The legend says that who masters the power of the stone would be able to train any model ever wanted.
Scary, isn’t it? I don’t know if the story is true, but back in the modern life the big players in the machine learning space seem to be interested in making such story a reality (probably with some alterations). In this post, I will share which options are available and try to help you build some intuition about what they are doing (since although all having the word “auto” in their names, they share nothing in common)
Azure Automated Machine Learning (now GA)
Cloud-based: Yes (Evaluation, Training can be done in any compute target)
Supports: Classification, Regression
Techniques: Probabilistic Matrix Factorization + Bayesian optimization
Training framework: sklearn
The idea behind this approach is that if two data sets have similar (i.e. correlated) results for a few pipelines, it’s likely that the remaining pipelines will produce results that are similar as well. This may sound familiar to you. Specially if you have worked before with a collaborative filtering problem for recommendation, where if two users liked the same item in the past, it’s more likely that they will like similar ones in the future.
Solving the problem implies two things: learning a hidden representation to capture the relationships between different data sets and different machine learning pipelines to predict the accuracy a pipeline will get on a given data set, and, learning a function that successfully informs you which pipeline to try next. The first task is addressed by creating a matrix with the balanced accuracy that different pipelines got over different data sets. The paper describing the method specifies they tried 42.000 different ML pipelines over around 600 data sets. Probably what you can see today in Azure is different, but you may get an idea. The author states that the hidden representation successfully captured information not just about the models, but about the hyper-parameters and the data-sets characteristics. (note that this learning process is happening in an unsupervised fashion).
The model described so far can be used to predict the expected performance of each ML pipeline as a function of the pipelines already evaluated, but does not yet give any guidance as to which pipeline should be tried next. Since they used a probabilistic version of the matrix factorization, the method produces a predictive posterior distribution over the performance of the pipelines, hence enabling the use of an acquisition functions (Bayesian optimization) to guide the exploration of the ML pipeline space. Basically the method will pick up the next pipeline that will maximize the expected increase in the accuracy.
However, recommendation systems suffer for a very particular problem: cold start. The idea that if a new data set come along to the system, i.e. your data set, then the model has no clue about what this new data set is similar to. To solve the cold start problem, some meta-features are computed from the data set to capture characteristics like number of observations, number of classes, value’s ranges, etc. Then by using those metrics they identify a close data set in the space of known data sets. They try it 5 times over different proposed pipelines until they start to use the acquisition function to inform which data set to try next. Notice that this method doesn’t require access to the actual data set, just to those meta-features which are computed locally (which makes it cheaper).
Google AutoML (Beta)
Cloud-based: Yes (Training and evaluation)
Supports: CNN, RNN, LSTM for classification.
Techniques: Reinforcement learning with gradient policy upgrade
Training framework: TensorFlow
When it comes to neural networks, recent state-of-the-art models’ success is associated with a paradigm shift from feature designing to architecture designing. That is, building machine learning architectures that are able to learn the best representations from the data in an unsupervised way instead of directly engineering such features (which is complex and require a lot of knowledge around the data). However, designing architectures still requires a lot of knowledge and takes time. The idea with Google AutoML is to create a meta-model capable of learning a way to design, generate and propose architectures that, when creating a model using that architecture, it will perform well on a data set of interest (wow).
They use Neural Architecture Search implemented as a RNN that generates (samples) architectures encoded as a variable-length sequence of tokens (which is a very sophisticated way of saying “an string”). The sequence of tokens generated can be seen as a sequence of actions that should be done to generate the architecture.
Once an architecture is generated, the proposed model is constructed and trained to finally record the acquired accuracy. The RNN is trained used reinforcement learning with a policy that updates the parameters of the RNN so that it generates better architectures over time.
The model will eventually achieve an accuracy R on the data set. We can then think to use R as a reward signal along with reinforcement learning to train the RNN. However, such a reward is non-differentialable which is why they propose to use a policy gradient method to iteratively update the parameters as suggested in Williams (1992) — with some alterations. Since the training process can be very time consuming, they use distributed training and asynchronous parameter updates in order to speed up the learning process as in Dean et al., 2012.
Which models it can generate? Based on the paper published, for convolutional architectures they use rectified linear units as non-linearities (Nair & Hinton, 2010), batch normalization (Ioffe & Szegedy, 2015) and skip connections between layers (Szegedy et al., 2015 and He et al., 2016a). For every convolutional layer, it can select a filter height in [1, 3, 5, 7], a filter width in [1, 3, 5, 7], and a number of filters in [24, 36, 48]. For strides, it has to predict the strides in [1, 2, 3]. For RNN and LSTM, the architecture supports the selection of an activation method in [identity, tanh, sigmoid, relu]. The number of input pairs to the RNN cell (the “base number”) is set to 8.
Supports: CNN, RNN, LSTM for classification.
Techniques: Efficient Neural Architecture Search with Network Morphism
Training framework: Keras
AutoKeras builds on the same idea Google AutoML does: it uses an RNN controller trained in a loop that samples a candidate architecture, i.e. a child model, and then trains it to measure its performance on the task of desire. The controller then uses the performance as a guiding signal to find more promising architectures. However, what we didn’t mention before is how computing expensive the process can be. Neural Architecture Search is computationally expensive and time consuming, e.g. Zoph et al. (2018) uses 450 GPUs for ~40,000 GPU hours. On the other hand, using less resources tends to produce pour results. To address this issue, AutoKeras uses Efficient Neural Architecture Search (ENAS).
ENAS applied a similar concept than transfer learning which established that parameters learned for a particular model on a particular task can be used for other models on other tasks. Hence, ENAS force all child models generated to share weights to deliberately avoid training each child model from scratch. The authors of the paper shows that not only is sharing parameters among child models possible, but it also allows for very strong performance.
Supports: Classification, Regression.
Techniques: Bayesian optimization + automated ensemble construction
Training framework: sklearn
Auto-sklean is based on the definition of CASH (Combined Algorithm Selection and Hyperparameter optimization) problem used by Auto-Weka and the same idea behind Azure Automated ML: they consider the problem of simultaneously selecting a learning algorithm and setting its hyper-parameters. The main difference they propose is to incorporate two extra steps to the main process: a meta-learning step at the beginning and an automated ensemble construction step at the very end as explained in this paper.
The method characterizes data sets using a total of 38 meta-features including simple, information-theoretic and statistical meta-features such as the number of data points, features, and classes, as well as data skewness, and the entropy of the targets. By the use of such information they selecting k configurations to seed Bayesian optimization. Note that this meta-learning approach draws its power from the availability of a repository of data sets in the same way that Azure Automated ML does.
After the Bayesian optimization phase is done, they construct an ensemble of all the models they tried out. The idea at this step is to save all the hard-work done on the training of each model built. Rather than discarding these models in favor of a better one, they store them to eventually construct an ensemble out all of them. This automatic ensemble construction avoids to commit itself to a single hyper-parameter setting and is thus more robust (and less prone to over-fitting). They construct the ensemble using ensemble selection (a greedy procedure that starts from an empty ensemble and then iteratively adds the model that maximizes ensemble validation performance).
Conclusions and opinions
Each of these methods have their own advantages and disadvantages as well as a different spaces to compete with. Azure Automatic ML and auto-sklearn are built on the same idea, work for regression and classification tasks, are less computing intensive, and hence more cheaper to implement. They don’t need to see the entire data set (as long as the constructed model is generated somewhere) which makes them appropriate if you care about privacy. However, they heavily rely on the set of data sets they already seen. They cannot try anything new besides the ML pipelines that were considered ahead of time. I personally would doubt to call these method meta-learning.
On the other hand, Google AutoML and AutoKeras uses the same approach by trying to learn a way to construct models from scratch. It is a more ambitious statement which is the reason why it is also more limited in the action space (CNN, RNN, LSTM). However, their reinforcement learning approach makes them able to explore new ways to construct models (Google claims their method found a model that was 1.05x better than one they already had). This sound more like meta-learning. However, RL methods consume a lot of computing power which is the reason why they charge you $20 USD the hour. AutoKeras performance optimizations are really attractive as an exchange for accuracy in this settings (plus it is open source, which is good again if you care about privacy).