Optimization or tuning of hyper-parameters is the question of choosing an appropriate range of hyper-parameters for a learning algorithm. A hyper-parameter is a parameter of which its value controls the learning process. The values of other parameters (usually node weights) are learned, by contrast.
One of the frustrating things about deep training in Neural Nets is the right number of hyper-parameters that you have to deal with, ranging from Learning rate alpha to momentum beta, if you are using momentum, or the beta-one, beta-two, and epsilon hyper-parameters for the Adam Optimization Algorithm. You may need to pick the number of layers, the number of hidden units for the various layers, and may want to use learning rate decay, because you don’t just use a single learning rate alpha. Then you may need to select the size of the mini-batch. So some of those hyper-parameters are more important than others, it turns out.
Label by colors (red > yellow > purple):
The most experienced applications I ‘d say, alpha, the most critical hyper-parameter to tune is the learning rate.
Other than alpha, there are a few others hyper-parameters I prefer to tune next. Maybe the momentum term would be, say, 0.9 is a good default.
I would also tune the size of the mini-batch to make sure the optimization algorithm runs efficiently. I try to mess with the hidden units, too, sometimes.
Those are basically the three that I would find to be second in importance to the alpha learning rate and then third in importance after fiddling with the others. The amount of layers can also make an enormous difference, and so the learning rate will decline.
But this isn’t a tough and quick rule, and I think other deep learning practitioners might disagree with this or have different intuitions about it.
If you’re trying to balance a set of hyper-parameters, how do you pick a set of values you want to explore?
Try Random Values: Don’t use Grid
In older decades of machine learning algorithms, if you had two hyper-parameters, which I call hyper-parameter one and hyper-parameter two, it was normal practice to sample the points in a grid like this and to explore these values systematically.
I am placing a five by five grid down here. It might be more or less than the five by five grid, but in this example you try out all 25 points and then choose whichever hyper-parameter works best. And when the number of hyper-parameters was fairly low, this method works okay.
What we do in deep learning, and what I recommend you do instead, is select the points at random. And the reason you do that is that it’s hard to know in advance which hyper-parameters would be the most important to your problem.
So let’s say to take an example, hyper-parameter one turns out to be alpha, the learning rate. And let’s say, to take an extreme example, that hyper-parameter two was the epsilon value you have in the Adam algorithm denominator. So your choice of alpha matters a lot, and it barely matters to choose the epsilon. So if you check in the grid then you’ve tried five alpha values and you can find that all the unique epsilon values give you the same response ultimately. So now you’ve trained 25 models and only got five values in trial for the alpha learning rate, which I think is important.
If you were to sample at random, you will have tested 25 distinct values of the alpha learning rate, so you are more likely to find a value that works really well.
For reality, you may look for many more hyper-parameters than these, so if you have, say, three hyper-parameters, I think instead of looking for a square; you are searching for a cube where this third dimension is three hyper-parameters, and then you can seek many more values of each of your three hyper-parameters by sampling within this three-dimensional cube.
In reality, you may look for far more hyper-parameters than three and often it’s hard to know in advance which one’s turn out to be the very important hyper-parameters for your application and sampling at random rather than the grid shows you are testing the spectrum of values for the most important hyper-parameters, whatever they might be.
- in grid search: only n distinct values of alpha are tried
- in random choice: can have n*n distinct values of alpha
Coarse to fine
Another common practice when you are sampling hyper-parameters is to use a coarse to fine sampling scheme.
So let’s assume in this two-dimensional example that you sample these points, and maybe you’ve noticed that this point works the best and maybe a few other points around it seemed to perform pretty well, so in the final’s course scheme what you would do is a zoom in to a smaller region of the hyper-parameters and then sample more density within that space.Or maybe again at random, but if you think the best environment, the hyper-parameters, could be in this area, then concentrate more resources on searching inside this blue square. So after doing a coarse sample of this complete square, that tells you to focus on a smaller square afterwards. You can then more densely sample into a smaller square.
Therefore, this a coarse-to-fine search is also often used. And by checking these unique values of the hyper-parameters you can then choose which value helps you to do the best for your training set target or to do the best for your growth set or whatever you are trying to maximize in your quest process for hyper-parameters.
The two main takeaways are to use random sampling and sufficient searching and optionally consider introducing a coarse-to-fine search. But there is more to the quest for the hyper-parameter than this.
- zoom into smaller regions of hyperparam space and re-sample more densely.