Published in


Tuning hyper-parameters with CodeFlare Pipelines

GridSearchCV() is often used for hyper-parameter turning for a model constructed via sklearn pipelines. It does an exhaustive search over specified parameter values for a pipeline. It implements a fit() method and a score() method. The parameters of the pipeline used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

Here we show how to convert an example of using GridSearchCV() to tune the hyper-parameters of an sklearn pipeline into one that uses Codeflare (CF) pipelines grid_search_cv(). We use the “Pipelining: chaining a PCA and a logistic regression” from sklearn pipelines as an example.

In this sklearn example, a pipeline is chained together with a PCA and a LogisticRegression. The n_components parameter of the PCA and the C parameter of the LogisticRegression are defined in a param_grid: with n_components in [5, 15, 30, 45, 64] and C defined by np.logspace(-4, 4, 4). A total of 20 combinations of n_components and C parameter values will be explored by GridSearchCV() to find the best one with the highest mean_test_score.

After running GridSearchCV().fit(), the best parameters of PCA__n_components and LogisticRegression__C, together with the cross-validated mean_test scores are printed out as follows. In this example, the best n_components chosen is 45 for the PCA.

The PCA explained variance ratio and the best n_components chosen are plotted in the top chart. The classification accuracy and its std_test_score are plotted in the bottom chart. The best n_components can be obtained by calling best_estimator_.named_step[‘pca’].n_components from the returned object of GridSearchCV().

Converting to CF pipelines grid_search_cv()

We next describe the step-by-step conversion of this example to one that uses Codeflare pipelines.

Step 1: importing codeflare.pipelines packages and ray

We need to first import various codeflare.pipelines packages, including Datamodel and runtime, as well as ray and call ray.shutdwon() and ray.init(). Note that, in order to run this codeflare example notebook, you need to have a running ray instance.

import ray

Step 2: defining and setting up a codeflare pipeline

A codeflare pipeline is defined by EstimatorNodes and edges connecting two EstimatorNodes. In this case, we define node_pca and node_logistic and we connect these two nodes with pipeline.add_edge(). Before we can execute fit() on a pipeline, we need to set up the proper input to the pipeline.

Step 3: defining pipeline param grid and executing Codeflare pipelines grid_search_cv()

Codeflare pipelines runtime converts an sklearn param_grid into a codeflare pipelines param grid. We also specify the default KFold parameter for running the cross-validation. Finally, Codeflare pipelines runtime executes the grid_search_cv().

Step 4: parsing the returned result from grid_search_cv()

As the Codeflare pipelines project is still actively under development, APIs to access some attributes of the explored pipelines in the grid_search_cv() are not yet available. As a result, a slightly more verbose code is used to get the best pipeline, its associated parameter values and other statistics from the returned object of grid_search_cv(). For example, we need to loop through all the 20 explored pipelines to get the best pipeline. And, to get the n_component of an explored pipeline, we first use .get_nodes() on the returned cross-validated pipeline and then use .get_estimator() and then finally use .get_params().

Due to the differences in split, the Codeflare pipelines grid_search_cv() produces the best pipeline with n_components = 64 for the PAC, and the 2nd best with n_components = 45. We print out the parameters of the 2nd best and the best pipelines as follows.

The corresponding plots are similar to those from the sklearn GridSearchCV(), except that the n_components chosen is 64 for the best score for the Codeflare pipelines grid_search_cv().

The Jupyter notebook of this example is available here. Please download it and try it out to understand how you might convert an sklearn example to one that uses Codeflare pipelines. And please let us know what you think.



CodeFlare Simplifying the integration, scaling and acceleration of complex multi-step analytics and machine learning pipelines on the hybrid multi-cloud

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store