Leverage your Kubernetes orchestration tool to tune machine learning models.
At idealo, we use Argo Workflows & Pipelines to orchestrate our Kubernetes deployments. As machine learning engineers, we found that Argo is in addition to this main purpose also useful to perform compute-intensive machine learning jobs. This allows us to leverage our cluster’s spare capacity for hyperparameter optimization.
Hyperparameter optimization is a way of improving machine learning models, where many modified model versions are trained and then evaluated to identify the best performing candidate. For example, an artificial neural network could be trained with different numbers of neurons to identify the ideal network size. Because hyperparameter optimization usually requires many repetitive training cycles, it makes sense to parallelize the individual training jobs.
How We Use Argo for Hyperparameter Optimization
We start our parallel training jobs through the orchestration tool Argo and manage the hyperparameter optimization via the Python package Optuna. The Argo workflow first starts containers that download the training data and create a database to log results. Then, it starts a study in Optuna with several parallel workers that perform the hyperparameter optimization. Each of these parallel workflow steps repeatedly trains and evaluates the model for different hyperparameters. Optuna picks the hyperparameters for the training in an intelligent way by considering promising past runs. In a final step, the best parameter combination is printed.
The Argo workflow steps that perform the hyperparameter optimization repeat until a condition is met, such as the maximum training time. The stopping condition could also be the load on the cluster, which would allow to optimally use the spare capacity of the cluster.
Note that most Kubernetes clusters do not support GPU or TPU instances. Therefore, running machine learning jobs on Kubernetes makes the most sense for models that perform well on CPUs.
The Argo workflow definition for hyperparameter optimization of an XGBoost model can be found here: https://github.com/drosin/argo-hyperparameter-tuner.
Two Ways of Starting Training Jobs
In the repo, we include two Argo workflows that differ in how training jobs are scheduled
- The workflow
hyperparameter-tuner-argo-level.yamlstarts a new pod for each training job. This is preferred if you want a clean environment for each run or you have problems with memory leakage.
- The workflow
hyperparameter-tuner-optuna-level.yamlruns all training jobs of one parallel branch within a single pod. Here, Optuna handles the starting of training jobs. This is preferred when individual training jobs finish fast compared to the pod creation time.
The Bottom Line
You can combine a Kubernetes orchestration tool, such as Argo, with Optuna to run hyperparameter optimization jobs on your cluster. This allows taking advantage of a cluster’s spare to optimize CPU-based machine learning models.
You love machine learning? Have a look at our vacancies.