Today deep learning is applied to several different areas of application, and intuitions about hyperparameter settings from one area of application can or may not move to another. There’s a lot of mixing between the domains of different applications. So one pleasant development in deep learning is that people from different application domains are increasingly reading research papers from other application domains to look for cross-fertilization inspiration.
Finally, in terms of how people look for hyperparameters, I see perhaps two main fields of thinking. One approach is if you put one pattern to babysitting. And typically you do this because you might have a large data set but not a lot of computing resources, not a lot of CPUs and GPUs, so you can essentially train only one model or a very limited number of models at a time.In that case you could babysit the model slowly, even as it’s practicing. So, you could initialize your parameter as random on Day 0 for example, and then start training. And you are gradually watching your learning curve, perhaps the cost function J or your data set error, or something else, decrease gradually over the first day. Then at the end of the first day, you might say, “gee,” looks like it’s learning very well, I’m going to try to raise the learning rate a bit and see how it’s doing. And maybe then it does better. And then your success on Day 2. And two days later you say, all right, it’s still doing pretty well.And now, you ‘re in Day 3. Then you look at it sort of every day and keep nudging the boundaries up and down. And every one day you found your learning rate was too high. So, you might go back to the pattern of the previous day, and so on. But one day at a time you are kind of babysitting the model even as it’s training over a course of many days or over several different weeks.So that’s one strategy, and people babysitting one pattern, looking out for success and gently nudging up or down the learning rate. But that’s typically what happens when you don’t have enough computing resources to simultaneously train a lot of models.
Another solution would be if you were to train several parallel models. So you might have some setting of the hyperparameters and just let it run on its own, either for a day or even for multiple days, and then get some learning curve like that; and this might be a plot of cost function J or cost of your training error or cost of your data set error, but some metric in your tracking. And then you could start up a different model with a different setting of the hyperparameters at the same time. And so, your second model might produce a specific curve of learning, increasing one that looks like this. I’ll say it looks better. Or you could train a lot of different models in parallel, where these orange lines are different models, right, and you can try a lot of different hyperparameter settings and then only select the one that works better quickly at the end.
And, to make an comparison, I’ll name the panda approach to the approach on the left. Once pandas have kids, they have very few kids, usually one child at a time, and then they really put a lot of work into helping the baby panda to survive. So this is babysitting really. A model, or a panda infant. Whereas on the right the approach is more like what the fish do. I’ll call it the caviar technique. In one mating season there are some fish that lay over 100 million eggs. So the way that fish reproduce is that they lay a lot of eggs and don’t pay too much attention to any of them but just see that perhaps one of them, or maybe a bunch of them, will do well.
So the way to choose between these two approaches is really a function of how much computational resources you have. If you have enough computers to train a lot of models in parallel then by all means take the caviar approach and try a lot of different hyperparameters and see what works.
In Short: Two major Training Schools
* Panda approach: One model for babysitting
- Huge data collection, limited resources in computation, can only train one computer as it trains the computer. Look at learning curve, try to adjust hyparams once a day.
* Caviar approach: Parallel testing of multiple models
-Have enough power to render computations.
-Choose the best model / hyperparams being trained simultaneously in parallel.