Are You Still Using Grid Search for Hyperparameters Optimization?
Let's discuss the ideas behind how to search in a smart fashion the hyperparameters for your machine learning models.
When we are training a machine learning model, we have some choices to make, from which model to use, how to prepare our data set, how to handle outliers, and so on.
One of the choices is the hyperparameters; those are the parameters that control the learning process but can't be derived by the training itself. Some examples are:
- The learning rate, epochs, and the number of layers/neurons in a neural network.
- The number of trees, max depth, and quality criteria in a random forest.
- The kernel, gamma, and C values from support vector machines.
Some of the questions that come along are: What hyperparameters should I set to this model? Are the default ones good enough? Can I improve the performance just by trying a few more architectures? And, how can I decide which options to try? All those questions are related to hyperparameter tuning [1].
In this post, I'll discuss:
- The standard strategies for hyperparameters tuning in supervised models.
- The problem with not informed strategies.
- The trade-offs of informed strategies.
- How to implement the genetic algorithms (GA) ideas for hyperparameter tuning.
- What extra features we could bring to the GA.
- References and examples to python packages with several implementations.
If you want to go right away for some code with examples, you can check my other medium post with that content.
1. Base Strategies:
There are several options to follow to find those hyperparameters that are going to make your model boost out, to mention a few:
- Grid Search
- Randomized Grid Search
- Bayesian Optimization
- Genetic Algorithms
Both Grid Search and Randomized Grid Search are what we could call a "brute force approach," meaning that the choices of hyperparameters are not in an informative way, but by just trying some options and hoping it works.
They have the advantage of being simple to implement and can be fast enough to get decent results, try some combinations and choose the best based on some metric, so for two hyperparameters; the process may look like this:
The above picture represents how Grid and Randomized Grid Search might perform trying to optimize a model which scoring function (e.g., the AUC) is the sum of the green and yellow areas, and the contribution to the score is the height of the areas, so basically only the green one is significant for the score.
2. Visualizing the problem
You can see that the Randomized layout is somehow better than the Grid one since it could explore more different values of the optimization function, and as one of the parameters was not significant, the grid search "wasted" six extra models with no important changes in the scoring, but in both cases, by chance, we could get a value that is near or far away from the optimal.
I'll oversimplify this representation in a one-dimensional function, just to gain some intuition:
In the above plot, the red line represents the cross-validation accuracy (or any other metric) that the estimator may achieve, and the dots are a fixed choice of hyperparameters laying on their accuracy score, so for example, if this were a neural network, the dots could represent the following:
dot 1: {"learning_rate": 0.001, "layers": 5, "optimizer": "Adam"}
dot 2: {"learning_rate": 0.01, "layers": 3, "optimizer": "Adagrad"}
dot 3: {"learning_rate": 1, "layers": 6, "optimizer": "SGD"}
dot 4: {"learning_rate": 0.001, "layers": 8, "optimizer": "RMSprop"}
dot 5: {"learning_rate": 0.1, "layers": 20, "optimizer": "SGD"}
According to this, we could end up in any of the values along that line, you want to get as high as you can, but by randomly or manually selecting the dots, we might or might not get good results.
OK.. we get the problem, so how can we make it better?
3. The informative way
The other two methods that I mentioned are Bayesian Optimization [3] and Genetic Algorithms [4], those, we could say, follow an informative criteria to make a choice, what they have in common, is that they follow a sequential process where try to find a better set of hyperparameters by the past decisions they made, so if we think it as an iterative process, it could look like this:
You can see that the first iteration might look at what a Randomized Grid would do, and it's OK, we have to start at some point; the idea is to explore the space of hyperparameters smartly, making progress in each iteration, modifying the number of sets to try, creating new ones, etc.
Of course, as expected in data science, this is not a silver bullet solution; several things might come up, to mention some:
- The exploration-exploitation dilemma: Should I go to less-explored regions in case I'm missing something? Or should I keep trying dots near to the regions I already know that show promising results?
- Local maximum: This comes as a possible result of the first one; what if my optimization function is non-convex, non-linear, and noisy?, what if I get stuck on local maximum thinking it is the best (global maximum) solution?
- Resource consumption: As you can see, we have made this an iterative process that may consume more resources than a simple Grid Search would do.
- New parameters: How many iterations should I run?, how can I control the exploration vs. exploitation? , should I create more sets in each iteration?, Aren't those just more hyperparameters that I have to decide?
Hopefully, there are several implementations that have already (until some point) taken care of those questions, and the final impact of those choices over the score we are optimizing is less sensitive than the choice of the hyperparameters in the machine learning model itself.
You could check, for example, sklearn-genetic-opt for the genetic algorithms approach or scikit-optimize for the bayesian one. I will explain the genetic algorithms approach.
4. Genetic Algorithms (GA) Approach
The Genetic algorithm is a metaheuristic inspired by natural selection; they are used in optimization and search problems in general and are usually based on a set of functions such as mutation, crossover, and selection [2]. Let's call these the genetic operators.
I'll use the following terms interchangeably in this section to make the connection between the GA and machine learning:
One choice of hyperparameters→An individual
Population→ Several individuals
Generation→One fixed iteration that contains a fixed population
Fitness value→Cross-validation score
There are several variations, but in general, the steps to follow look like this:
- Generate a randomly sampled population (different sets of hyperparameters); this is generation 0.
- Evaluate the fitness value of each individual in the population in terms of machine learning, and get the cross-validation scores.
- Generate a new generation by using several genetic operators.
- Repeat steps 2 and 3 until you meet a stopping criterion.
Let's go step by step.
4.1-4.2 Create generation 0 and evaluate it:
As mentioned, you could generate a random set of several hyperparameters, or you could include a few manually selected ones that you already tried and see as suitable candidates.
Each set gets usually encoded in the form of a chromosome, a binary representation of the group; for example, if we set the size of the first generation to be three individuals, it would look like this:
So in this generation, we get three individuals mapped to a chromosome (binary) representation using the encoding function represented as the red arrow. Each box in the chromosome is a gen. A fixed section of the chromosome is one of the hyperparameters.
Then we get the cross-validation score (fitness) of each candidate using a scoring function shown as the purple arrow.
4.3 Create a new generation:
Now we can create a new set of candidates; as mentioned, there are several genetic operators; I'm going to show the most common ones:
Crossover:
This operator consists of taking two parent chromosomes and mating them to create new children; the way we select the parents could be by a probability distribution function, which gives more probability to the individuals with higher fitness of the generation, let's say the individual number 1 and 3 got selected, then we can take two random points of each parent and make the crossover, like this:
Now the children represent a new set of hyperparameters; if we decode each child, we could get, for example:
Child 1: {"learning_rate": 0.015, "layers": 4, "optimizer": "Adam"}
Child 2: {"learning_rate": 0.4, "layers": 6, "optimizer": "SGD"}
But making crossovers over the same sets of hyperparameters might give similar results after some iterations, so we are stuck with the same kind of solutions; that is why we introduce other operations like the mutation.
Mutation:
With a low enough probability (< ~0.1), this operator randomly changes one of the gens or a whole hyperparameter to create more diverse sets.
Let's take, for example, the child one from the previous image; let us pick up random gen and change its value:
Or it could even change a whole parameter, for example, the optimizer:
Elitism:
This selection strategy refers to selecting the best individuals of each generation to make sure its information is propagated across the generations. This strategy is very straightforward, just select the best k individuals based on their fitness value and copy it to the next generation.
So after performing those operations, a new generation may look like this:
From now on, just repeat the process for several generations until you meet a stopping criterion, which could be for example:
- A maximum number of generations was reached.
- The process has run longer than the budget time.
- There are no performance improvements (below a threshold) from the last n generations.
The following image is an example of what we could get after running our GA optimizer for several generations; it's a regression model using a random forest with the r-squared metric:
We could plot the hyperparameters that were tried by the algorithm, like the dot plot, but in this case, following a most realistic representation, we can plot the histogram of sample values from each parameter and the scatter plot similar to the first plot.
The darker the color in this picture, the more different estimators were fitted in that region; the "score" is the same r-squared.
So at this point, the question is, how can we implement this? Do I have to make it from scratch? As mentioned before, there are already several packages that may help with this; we'll see the features of one of those.
Note: I'm the author of the following package; if you want to know further, contribute or make some suggestions, you can check the documentation and the GitHub repository at the end of this article.
Introducing Sklearn-genetic-opt
Sklearn-genetic-opt is a Python-based package that uses evolutionary algorithms from the DEAP package to choose the set of hyperparameters that optimizes (max or min) the cross-validation scores; the package can be used for both regression and classification problems. At this point, Sklearn-genetic-opt is compatible with any scikit-learn regressor or classifier (or a sklearn compatible one).
This package has the following features:
- GASearchCV: Principal class of the package, holds the evolutionary cross-validation optimization routine, and it has a similar API that the scikit-learn GridSeachCV.
- GAFeatureSelectionCV: Complementary class of the package, it makes feature selection using the same evolutionary algorithms as GASearchCV, but by trying to minimize the number of features used while optimizing the cv-scores, all this, with a fixed set of hyperparameters.
- Algorithms: Set of different evolutionary algorithms to use as an optimization procedure.
- Callbacks: Custom evaluation strategies to generate early stopping rules, logging, or your custom logic.
- Adapters: Strategies to change the parameters as a function of the generations.
- Plots: Generate pre-defined plots to understand the optimization process. For example, the last two plots came from the package.
- MLflow: Build-in integration with mlflow to log and manage the lifecycle of all the hyperparameters, cv-scores, and the fitted models.
So the way it works is by following similar procedures like the one described in the last section, but it adds some extra features like the callbacks, MLflow logging, and different algorithms approach; it looks like this:
And this is how may look the MLflow logging:
And this is a straightforward example of how to use it for hyperparameters tuning:
You can learn more about each part of this code in this article.
So that is it! Thank you for reading this content; I hope it was helpful for thinking about your tuning strategies; here are the related links for the last package in case you want to check it:
Documentation: https://sklearn-genetic-opt.readthedocs.io/en/stable/
Repository: https://github.com/rodrigo-arenas/Sklearn-genetic-opt
References
[1] https://en.wikipedia.org/wiki/Hyperparameter_optimization
[2] https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
[3]https://www.cs.toronto.edu/~rgrosse/courses/csc411_f18/tutorials/tut8_adams_slides.pdf