Hyperparameter Tuning with Kubeflow Katib

Hoi
Analytics Vidhya
Published in
6 min readSep 16, 2019

--

Introduction

Katib is a tool that comes with Kubeflow which specialized in hyperparameter tuning. Kubeflow is rather new, an so is Katib. Meanwhile, there are a lot of other hyperparameter tuning frameworks. So why bother with Katib?

To me, the greatest advantage of using Katib for hyperparameter tuning is its direct integration with Kubernetes (K8s) through Kubeflow. This provides a few advantages:

A. Enjoy the Scalability of K8s

If training a model is slow, hyperparameter tuning is 10x~100x slower. Horizontal scalability of K8s means that you could train any number of models in parallel. This is fully managed by Katib, including hyperparameters generation and metrics collection.

B. Direct Deployment of a Tuned Model to K8s

While we won’t discuss how to deploy a model to K8s in this article, you probably realize that more and more machine learning model is deployed in K8s for production. They enjoy high availability, auto-scalability, model rolling update and many more advantages over a single web server.

A framework like Seldon or Kubeflow fairing eases the job of machine learning engineer to deploy models to K8s. Katib, being part of the ecosystem of Kubeflow, ensures that your model is ready to deploy to K8s without much hassle.

C. Some Other Features of Katib

  1. Generic enough that ANY model wrapped as a Docker container could be used.
  2. Preemptible instances (GCP) / Spot instances (AWS) could be used to greatly cut the infrastructure cost. Auto-scalability of K8s means that you pay for only how much computation power you spent.
  3. A basic UI for result housekeeping and visualization. By versioning the study configuration file and Docker container image, results are reproducible and managed.
  4. Some prebuilt algorithm, from random sampling to reinforcement learning approach to tune hyperparameters.

Setup Kubeflow Katib (One-off)

A. Setup Kubeflow

This article assumes that you have K8s setup with Kubeflow installed, otherwise you should follow the official documentation of Kubeflow to set up one. Make sure that your “kubectl” command line is set up correctly.

B. Setup K8s Node Pool

With Kubeflow running on K8s, the next thing is to create a node pool for running the hyperparameters tuning job. Below we use GCP as an example of how to achieve this.

  1. Inside Kubernetes Engine, locate your Kubeflow K8s cluster. Choose “Add Node Pool”

2. Give your node pool a name, we will need this later. Enable auto-scaling, pick a minimum of 0 and maximum of 100. Fill in 0 for “number of nodes”.

3. Choose the machine type that you prefer, select “Enable pre-emptible nodes (beta)” to use pre-emptible instance for a lower cost.

4. Save your configuration. You should see your node pool showing up in the K8s panel.

Using Katib

A. The Katib Dashboard

You could visit the Katib dashboard from the drop-down menu of Kubeflow. Below is an example of my dashboard, which includes 2 studies.

The Katib dashboard showing two studies and multiple study jobs.

A study is a collection of study jobs.

Each study job is a collection of runs. For example, a study job may contain the metrics of 50 runs with 50 different sets of hyperparameters.

B. Prepare a Docker Image

Katib supports any kind of machine learning model, which fulfills the following requirements:

  1. Be built as a Docker image and uploaded to where K8s is accessible,
  2. Takes hyperparameters as arguments, e.g. “— epochs=10”, “ — rl=0.5”
  3. Prints metrics to the stdout in the format of “metric=metric_value”.

My personal preference is to use s2i which could turn a Python package directly into a Docker image. As a result, my build script contains simply three lines which build, tag and push the image:

My Docker image is executable by the command “python -u train.py <arguments> <hyperparameters>” to do the training and printing metrics.

C. Create a Study Job

You may consult the Katib dashboard UI to create a new study. Personally I feel much simpler to just directly apply a K8s config. Below is a template for creating a study job configuration.

  1. Fill in metadata.name, spec.studyName, and spec.owner for the identifier of your study job, the identifier of your study and your name respectively.
  2. spec.objectvievaluename is the main objective metric to be collected by Katib. Additional metrics could be specified in spec.metricsnames for cross-checking. Make sure that your program print these metrics to the stdout in the format of “metric=metric_value”
  3. spec.parameterconfigs let you specify the range of hyperparameters, sampled hyperparameters would be passed into your program in the form of “ — hyperparameter=hyperparameter_value”.
  4. spec.workerspec.goTemplate.rawTemplate is the actual job to be run. Make sure you fill in the image that you uploaded and the command to be executed. Note that the sampled hyperparameters are injected to the command by Katib.
    Furthermore, you could specify which node pool to execute your jobs via the nodeSelector attribute.
  5. spec.suggestionSpec let you specify the strategy in choosing hyperparameters. In this example, we chose the simplest random sampling strategy and execute 10 runs in parallel. You may schedule more jobs in parallel as long as your node pool supports.
    Personally, I scheduled 100 jobs in parallel training a relatively huge model without any issues. Katib supports other strategies like grid search, hyperband, Bayesian optimization and more. Consult the example configuration yaml for more info.

After creating a configuration yaml, simply execute “kubectl apply -f study_job.yaml” to start hyperparameter tuning! You may delete and rerun the study job with “kubectl delete -f study_job.yaml” any time.

D. Checking the result

A typical screenshot of a study job result.

Above shows a typical screenshot of a study job result. Each line represents a single run’s configuration and its metrics.

You could even do some basic analysis by limiting the runs along each metric axis. For instance, in the figure below, it is shown that for my model, runs with high IoU metric usually uses a dice loss.

Doing analysis by showing only runs with high IoU.

Conclusion

In this article, we introduced the motivation of doing hyperparameters tuning on K8s via Kubeflow’s Katib tool. We showed how to set up for Katib for the first time, wrap a model into a Docker image and apply a study job configuration yaml to tune hyperparameters. You could also do basic analysis through Katib’s UI to gain insight on hyperparameter choices.

This article is meant for a jumpstart guide for you to set up a highly effective hyperparameters tuning solution. You would probably find some of the detail missing from this article, feel free to leave me a comment for any question.

--

--