How Katib tunes hyperparameter automatically in a Kubernetes native way

hg liu
hg liu
Sep 24, 2019 · 7 min read

Katib is a Kubernetes Native System for Hyperparameter Tuning
and Neural Architecture Search.
The system is inspired by Google vizier and supports multiple ML/DL frameworks (e.g. TensorFlow, PyTorch and so on).

github: https://github.com/kubeflow/katib

Install Katib

Katib is released as a component of kubeflow. The latest released kubeflow v0.6.2 includes Katib v1alpha2 version. You can install it by following this guideline.

Here I focus on introducing Katib v1alpha3 version, which is still in development and will be inlcuded in kubeflow 0.7.0 release. Now nearly all features are ready to be tested, you can install it as below (Kubernetes 1.12+ version must be already installed).

git clone git@github.com:kubeflow/katib.git
bash katib/scripts/v1alpha3/deploy.sh

After installed, you can verify if the deployment are available.

# kubectl get deploy -n kubeflow
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
katib-controller 1 1 1 1 56s
katib-db 1 1 1 1 54s
katib-manager 1 1 1 1 55s
katib-manager-rest 1 1 1 1 55s
katib-ui 1 1 1 1 54s

Take a try

After install Katib v1alpha3, you can run kubectl apply -f katib/examples/v1alpha3/random-example.yaml to try the first example of Katib.
Then you can get the new Experiment as below. Now I am going to introduce Katib concepts based on this example.

# kubectl get experiment random-example -n kubeflow -o yaml
apiVersion: kubeflow.org/v1alpha3
kind: Experiment
metadata:
...
name: random-example
namespace: kubeflow
spec:
algorithm:
algorithmName: random
maxFailedTrialCount: 3
maxTrialCount: 12
metricsCollectorSpec:
collector:
kind: StdOut
objective:
additionalMetricNames:
- accuracy
goal: 0.99
objectiveMetricName: Validation-accuracy
type: maximize
parallelTrialCount: 3
parameters:
- feasibleSpace:
max: "0.03"
min: "0.01"
name: --lr
parameterType: double
- feasibleSpace:
max: "5"
min: "2"
name: --num-layers
parameterType: int
trialTemplate:
goTemplate:
rawTemplate: |-
apiVersion: batch/v1
kind: Job
metadata:
name: {{.Trial}}
namespace: {{.NameSpace}}
spec:
template:
spec:
containers:
- name: {{.Trial}}
image: docker.io/katib/mxnet-mnist-example
command:
- "python"
- "/mxnet/example/image-classification/train_mnist.py"
- "--batch-size=64"
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
restartPolicy: Never
status:
...

Experiment

When you want to tune hyperparameter for your machine learning model before training it further, you just need create an Experiment CR like above. See what fields are included in the Experiment.spec:

  • trialTemplate: Your model should be packaged by image, and your model’s hyperparameters must be configurable by arguments (in this case) or environment variable so that Katib can automatically set the values in each trial to verify the hyperparameters performance.You can train your model by including your model image in Kubernetes Job(in this case), Kubeflow TFJob or Kubeflow PyTorchJob (for the latter two job, you should also install corresponding component).You can define the job by raw string way (in this case), but also can refer it in a configmap. See more about the struct definition as here
  • parameters:This field defines the range of the hyperparameters you want to tune for your model, Katib will generate hyperparameter combinations in the range based on specified hyperparameters tuning algorithm and then instantiate .HyperParameters template scope of the above trialTemplate field. See more about the struct definition as here
  • algorithm: There are many hyperparameter tuning algorithms which can be used to choose a set of optimal hyperparameters for a learning model. For now Katib supports random, grid, hyperband,bayesian optimization and tpe algorithms (More algorithms are being developed). And you can develop a new algorithm for Katib noninvasively (we will document the guideline about how to develop an algorithm for Katib soon). See more about the struct definition as here
  • objective: When the model training job with a set of generated hyperparameters starts, we need monitor how well the hyperparameters work with the model by related metrics specified by objectiveMetricName and additionalMetricNames. The best objectiveMetricNamemetrics (maximize or minimize based on type) value and corresponding hyperparameter set will be record in Experiment.status. And if objectiveMetricName metrics for a set hyperparameter exceeds (greater or less based on type) the goal, Katib will stop trying more hyperparameter combinations. See more about the struct definition as here
  • metricsCollectorSpec: When developing a model, developers are likely to print or record the metrics of the model into stdout or files during training. Now Katib can automatically collect the metrics by a metrics collector sidecar container. The metrics collectors for stdout, file or tfevent (specified by collector field, and metrics source specified by source field) are now available (more kinds of collectors will be available). See more about the struct definition as here
  • maxTrialCount: Katib can generate many sets of hyperparameter to test, but once the total of generated hyperparameters sets has exceeds maxTrialCount, hyperparameter tuning for the model stops.
  • maxFailedTrialCount: Some jobs with certain sets of hyperparameter maybe fail somehow. If the failed count of hyperparameter set exceeds maxFailedTrialCount, the hyperparameter tuning for the model will stop with Failed status.
  • parallelTrialCount: This fields specifies how many sets of hyperparameter to be tested in parallel at most.

Trial

For each set of hyperparameters, Katib will internally generate a Trial CR including fields about the hyperparameters key-value pairs, job manifest string with parameters instantiated and some other information like below. Trial CR is used for internal logic control, and end user can even ignore it.

# kubectl get trial -n kubeflow
NAME STATUS AGE
random-example-fm2g6jpj Succeeded 4h
random-example-hhzm57bn Succeeded 4h
random-example-n8whlq8g Succeeded 4h
# kubectl get trial random-example-fm2g6jpj -o yaml -n kubeflow
apiVersion: kubeflow.org/v1alpha3
kind: Trial
metadata:
...
name: random-example-fm2g6jpj
namespace: kubeflow
ownerReferences:
- apiVersion: kubeflow.org/v1alpha3
blockOwnerDeletion: true
controller: true
kind: Experiment
name: random-example
uid: c7bbb111-de6b-11e9-a6cc-00163e01b303
spec:
metricsCollector:
collector:
kind: StdOut
objective:
additionalMetricNames:
- accuracy
goal: 0.99
objectiveMetricName: Validation-accuracy
type: maximize
parameterAssignments:
- name: --lr
value: "0.027435456064371484"
- name: --num-layers
value: "4"
- name: --optimizer
value: sgd
runSpec: |-
apiVersion: batch/v1
kind: Job
metadata:
name: random-example-fm2g6jpj
namespace: kubeflow
spec:
template:
spec:
containers:
- name: random-example-fm2g6jpj
image: docker.io/katib/mxnet-mnist-example
command:
- "python"
- "/mxnet/example/image-classification/train_mnist.py"
- "--batch-size=64"
- "--lr=0.027435456064371484"
- "--num-layers=4"
- "--optimizer=sgd"
restartPolicy: Never
status:
completionTime: 2019-09-24T01:38:39Z
conditions:
- lastTransitionTime: 2019-09-24T01:37:26Z
lastUpdateTime: 2019-09-24T01:37:26Z
message: Trial is created
reason: TrialCreated
status: "True"
type: Created
- lastTransitionTime: 2019-09-24T01:38:39Z
lastUpdateTime: 2019-09-24T01:38:39Z
message: Trial is running
reason: TrialRunning
status: "False"
type: Running
- lastTransitionTime: 2019-09-24T01:38:39Z
lastUpdateTime: 2019-09-24T01:38:39Z
message: Trial has succeeded
reason: TrialSucceeded
status: "True"
type: Succeeded
observation:
metrics:
- name: Validation-accuracy
value: 0.981489
startTime: 2019-09-24T01:37:26Z

Suggestion

Katib will internally create a Suggestion CR for each Experiment CR. Suggestion CR includes information about the hyperparameter algorithm name by algorithmName field and how many sets of hyperparameter Katib is asking to be generated by requests field. The CR also traces all already generated sets of hyperparameter in status.suggestions. Same as Trial, Suggestion CR is used for internal logic control and end user can ignore it.

# kubectl get suggestion random-example -n kubeflow -o yaml
apiVersion: kubeflow.org/v1alpha3
kind: Suggestion
metadata:
...
name: random-example
namespace: kubeflow
ownerReferences:
- apiVersion: kubeflow.org/v1alpha3
blockOwnerDeletion: true
controller: true
kind: Experiment
name: random-example
uid: c7bbb111-de6b-11e9-a6cc-00163e01b303
spec:
algorithmName: random
requests: 3
status:
...
suggestions:
- name: random-example-fm2g6jpj
parameterAssignments:
- name: --lr
value: "0.027435456064371484"
- name: --num-layers
value: "4"
- name: --optimizer
value: sgd
- name: random-example-n8whlq8g
parameterAssignments:
- name: --lr
value: "0.013743390382347042"
- name: --num-layers
value: "3"
- name: --optimizer
value: sgd
- name: random-example-hhzm57bn
parameterAssignments:
- name: --lr
value: "0.012495283371215943"
- name: --num-layers
value: "2"
- name: --optimizer
value: sgd

What happens after an Experiment CR created

When a user created an Experiment CR, Katib controllers including experiment controller, trial controller and suggestion controller will work together to achieve hyperparameters tuning for user Machine learning model.

Katib workflow
  1. A Experiment CR is submitted to Kubernetes API server, Katib experiment mutating and validating webhook will be called to set default value for the Experiment CR and validate the CR separately.
  2. Experiment controller create a Suggestion CR.
  3. Suggestion controller create the algorithm deployment and service based on the new Suggestion CR.
  4. When Suggestion controller verifies that the algorithm service is ready, it calls the service to generate spec.request - len(status.suggestions) sets of hyperparamters and append them into status.suggestions
  5. Experiment controller finds that Suggestion CR had been updated, then generate each Trialfor each new hyperparamters set.
  6. Trial controller generates job based on runSpec manifest with the new hyperparamters set.
  7. Related job controller (Kubernetes batch Job, kubeflow PytorchJob or kubeflow TFJob) generated Pods.
  8. Katib Pod mutating webhook is called to inject metrics collector sidecar container to the candidate Pod.
  9. During the ML model container runs, metrics collector container in the same Pod tries to collect metrics from it and persists them into Katib DB backend.
  10. When the ML model Job ends, Trial controller will update status of the corresponding TrialCR.
  11. When a Trial CR goes to end, Experiment controller will increase request field of corresponding Suggestion CR if in need, then everything goes to step 4 again. Of course, if Trial CRs meet one of end condition (exceeds maxTrialCount, maxFailedTrialCount or goal), Experiment controller will take everything done, and best set of hyperparameters are record in .status.currentOptimalTrialfield .
# kubectl get experiment random-example  -oyaml
apiVersion: kubeflow.org/v1alpha3
kind: Experiment
metadata:
...
name: random-example
namespace: kubeflow
spec:
...
status:
...
currentOptimalTrial:
observation:
metrics:
- name: Validation-accuracy
value: 0.981489
parameterAssignments:
- name: --lr
value: "0.027435456064371484"
- name: --num-layers
value: "4"
- name: --optimizer
value: sgd

Future

Now, there are still some other features expected as below in future. Any contribution is welcome. And you can join kubeflow slack workspace #katib channel for question or discuss.

  • Early stopping feature.
  • More NAS.
  • More hyperparameter tuning algorithms.
  • More DB backend (now only implemented mysql), which is used to persist model metrics.
  • More metrics collectors.

About Me

My Name is Hou Gang, Liu and now working at IBM as Advisory Software Developer for Kubeflow Contribution. I started to contribute to Katib and other components of Kubeflow since December 2018. I’m currently served as Katib maintainer, manifest owner and Kfserving reviewer, besides I was also driving the Kubeflow contribution for IBM in China. Kubeflow is growing fast and we look forward to more and more contributors and users joining this great community. Stay tuned for the coming 0.7 and 1.0 Kubeflow release, I will show more update then.

My github: https://github.com/hougangliu

More From Medium

Also tagged Kubeflow

Related reads

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade