On this post I am going to show a comparison between the different optimization algorithms implemented till now in CTLearn, all of them applied to the same telescope and with the same configuration. This is useful to find the similarities and differences between optimization methods, as well as check their performances in terms of metric improvements, exploration and exploitation of the hyperparameter space, rate of convergence, improvement trends, etc.
To begin with, the configuration of the optimization runs performed is shown below:
- Metric optimized: AUC
- Number of random evaluations: 20
- Model optimized: single_tel
- Telescope optimized: MST NectarCam
- Iterations performed: 100
- The remaining options have been set to their default values.
In addition, the space of hyperparameters to optimize is the following:
- Number of filters in layer 1: [16, 64]
- Number of filters in layer 2: [16, 128]
- Number of filters in layer 3: [16, 256]
- Number of filters in layer 4: [16, 512]
- Kernel size of layer 1: [2,10]
- Kernel size of layer 2: [2,10]
- Kernel size of layer 3: [2,10]
- Kernel size of layer 4: [2,10].
On the other hand, four different optimization algorithms have been tested:
- Tree parzen estimators based bayesian optimization.
- Gaussian processes based bayesian optimization.
- Random forests based bayesian optimization.
- Gradient boosted trees based bayesian optimization.
A more detailed description of these methods can be found on my previus post. The results obtained are displayed below:
We can see that the greatest improvement in the metric optimized (AUC - ROC) has been obtained by the tree parzen estimators optimization algorithm, followed by the gaussian processes method and the gradient boosted trees method, which algo gets the biggest improvement in accuracy. The random forest algorithm is by far the worst performer, besides, it obtains its best results in iteration 12, while the remaining algorithms get their greatest improvements in the last iterations; this means that the former doesn’t take advantage of an improved surrogate that should get closer to the actual objective functions as the optimization process moves on.
The best hyperparameter sets obtained by the various algorithms differ, they don’t lead to the same results:
We can also plot the evolution of the metrics versus the iterations of the optimization algorithms:
We can check that the trend of the values of the metrics obtained by the algorithms is positive in the tree parzen estimators and gaussian processes cases, while it is slightly negative for the random forests method and quite negative for the gradient boosted trees algorithm. This may be due to a greater tendency to explore the space of hyperparameters in the latter method.
It is interesting to visualize the progress of the optimization algorithms by showing the best to date result at each iteration:
It is shown that the fastest convergence is achieved by the tree parzen estimator method, closely followed by the gaussian processes and gradient boosted trees methods, remaining the random forests algorithm as the worst in terms of rate of convergence.
Now, we can check the evolution of the searchs performed by the different algorithms, we see the histograms of explored values and, for each pair of hyperparameters, the scatter plot of sampled values is plotted with the evolution represented by color.
We can see that each optimization method converge to certain parts of space that are considered more promising by the algorithm and ,therefore, are explored further. These regions are different for each algorithm.
Finally, the gaussian processes, random forests and gradient boosted trees methods allow to plot the pairwise partial dependece of the objective function for each dimension of the space of hyperparameters:
By making these graphs we can gain intuition into the objective function sensitivity with respect to hyperparameters. This way we can decide which parts of the space may require more fine-grained search and which hyperparameters barely affect the score and can potentially be dropped from the search. For example, we can see from the charts above that the changes in the kernel size of the first layer and in the number of filters in the first and fourth layers almost do not affect the metric score.