Tuning Hyperparameters (part II): Random Search on Spark

Published in

Towards Data Science

4 min readMay 11, 2018

In this part II of this series, I am continuing my take on hyperparameters optimisations strategies, this time I want to have a closer look at Random Search from the point of view of Spark. Each part of the series can be read separately, feel free to check part I .

Random Search and Distributed Machine Learning Frameworks

No matter how well you designed your algorithms, how beautiful the mathematics may be, if the client requires a relative short training time on a huge volume of data, you better find a way to deliver!

Luckily for us, distributed computing is the peanut butter to Random Search’s jelly. Let us remind why Random Search is more efficient than Grid Search?

Grid-Search biggest enemy is the curse of dimensionality. For additional hyperparameter, and their respective choice of values, multiplies the search time. Rather than fixing values of the search space, it has been shown that it is more advantageous to sample it. To understand this, let us have a look at Figure 1. taken from the original paper.

As we see, and often the case in searches, some hyperparameters are more decisive than others. In the case of Grid Search, even though 9 trials were sampled, actually we only tried 3 different values of an important parameter. In the case of Random Search, 9 trials will test 9 different values of the decisive parameters.

Because each sample of the hyperparameters configuration is drawn independently from one another, we can see how easily this could be parallelized! Hence, there comes Spark!

Spark

Spark is a popular open-source framework for distributed computing on a cluster offering a wide library for manipulating Databases, streaming, distributed graph processing and most importantly for this discussion Machine Learning, i.e. Spark MLlib.

Spark MLlib received a huge boost lately thanks to the work by Microsoft’s Azure Machine Learning team, which released MMLSpark. From a practical Machine Learning’s perspective, MMLSpark most notable feature is the access to the extreme gradient boosting library Lighgbm, which is the go-to quick-win approach to most Data Science Proof of Concepts.

Now that every Data Scientist’s favorite library can be trained on a cluster, we are only missing a proper hyperparameters-tuning framework. The original Spark MLlib unfortunately only has an implementation of Grid Search. MMLSpark offers hypertuning with Random Search, but sadly the sampling is only uniform.

In Practice…

Uniform sampling is a great step, but not the most optimal one for many hyperparameters. Learning rate and regularisation hyperparameters come to mind in the case of Extreme Gradient Boosting algorithms like Lightgbm. Such parameters should instead be sampled on a logscale rather than uniformely on an interval.

So how do we hack Spark MLlib to satisfy our needs?

Let us first look at the key ingredients in Spark below.

As we can see, the grid of hyperparameter values is defined as an array of type ParamMap from an instance of the ParamGridBuilder class. Thus in order to remain compatible with Spark’s CrossValidator, let us proceed and redefine the build() and addGrid method.

Instead of adding to the grid a list of values for a hyperparameter, we would like to define a distribution instead, from which we later sample the configuration.

Breeze is a popular scala library for numerical processing with a great variety of distributions within breeze.stats.distributions.
For example, in the case a logistic regression we might want to define the following sampling space,

On one hand, we wish to sample from a distribution, on the other in the case of a set of categorical choice, we should be able to set an Array of choice.

We can propose the following solution,

Let us now try a final test for LightGBM,

There we go!