Smarter Parameter Sweeps (or Why Grid Search Is Plain Stupid)

Anyone that ever had to train a machine learning model had to go through some parameter sweeping (a.k.a. hyper-parameter optimization) to find a sweet spot for algorithm parameters. For random forests the parameters in need of optimization could be the number of trees in the model and the number of features considered at each split, for a neural network, there is the learning rate, the number of hidden layers, the number of hidden units in each layer, and several other parameters.

Hyper-parameter optimization requires the use (and maybe the abuse) of a validation set on which you can’t trust your performance metrics anymore. In this sense it is like a second phase of learning, or an extension to the learning algorithm itself. The performance metric (or the objective function) can be visualized as a heat-map in the n-dimensional parameter-space or as a surface in an n+1-dimensional space (the dimension n+1 being the value of that objective function). The bumpier this surface is (the more local minima and saddle points it has), the harder it becomes to optimize these parameters. Here are a couple of illustrations for two such surfaces defined by two parameters, the first one is mostly well behaved:

While the second one is more bumpy and riddled with several local minima:

The most common method at selecting algorithm parameters is by far the ubiquitous grid-search. In fact, the word “parameter sweep” actually refers to performing a grid search but has also become synonymous with performing parameter optimization. Grid-search is performed by simply picking a list of values for each parameter, and trying out all possible combinations of these values. This might look methodical and exhaustive. But in truth even a random search of the parameter space can be MUCH more effective than a grid search!

This amazing paper by Bergstra et al. claims that a random search of the parameter space is guaranteed to be more effective than grid search (and quite competitive in comparison with more sophisticated techniques).

Surprising, ha? Why should random search be better than the much more robust-looking grid-search? Here is why:

The idea is that in most cases the bumpy surface of the objective function is not as bumpy in all dimensions. Some parameters have much less effect on the cost function than others, if the importance of each parameter is known, this can be encoded in the number of values picked for each parameter in the grid-search. But that’s not typically the case, and anyway, just using random search allows the exploration of more values for each parameter, given the same amount of trials:

(The beautiful illustration is taken from the same paper referenced above)

More elaborate ways of optimizing algorithm hyper-parameters exist, in fact whole start-ups have been built around the idea (one of them recently acquired by twitter). A couple of libraries and several research papers tackle the problem, but for me, random sweeps are good enough for now.