We’ve recently used Kubeflow to build a Machine Learning app in AWS. It’s worked out well, but would we choose to do it again?
What is Kubeflow?
Kubeflow is an open source set of tools for building ML apps on Kubernetes.
The project is attempting to build a standard for ML apps that is suitable for each phase in the ML lifecycle: experimentation, data prep, training, testing, prediction, etc.
Everything can be orchestrated with Kubeflow Pipelines, which are controllable from a simple UI. There’s also integrated notebook servers for quick experimentation and easy access to the cluster’s resources.
It’s a very flexible and extensible framework due to it relying on Kubernetes to manage all code execution, resource management, and networking. Any other required Kubernetes applications (such as Dask in our case) can share the same cluster. We’ve also deployed a cluster autoscaler, which works seamlessly with Kubeflow, enabling the cluster to crunch some seriously large numbers with no added complexity.
Comparisons to Airflow
With Kubeflow, each pipeline step is isolated in it’s own container, which drastically improves the developer experience versus a monalythic solution like Airflow, although this perhaps shouldn’t be considered a benefit since Airflow has a Docker Operator, which essentially replicates this concept. Similarly, we could create a simple microservice to trigger Kubernetes jobs (and have done for other projects) and use Houston to orchestrate them.
Still, it’s nice to have this as a core concept. One key benefit would be direct integration of containers and the UI. I can click on a stage in my pipeline and view the container logs, and get information directly from Kubernetes about the scheduling, e.g. “Unschedulable: 0/4 nodes are available: 1 Insufficient pods, 3 Insufficient memory”.
I have seen people claim that Airflow’s huge library of operators gives it a clear edge over Kubeflow, but these people haven’t thought about how easy it would be to convert any Airflow operator to a Kubeflow component. There’s even a function to do this in the Kubeflow SDK:
Another benefit over Airflow that can’t be overstated is the responsiveness of the UI. Status changes update in real time. I can terminate a pipeline and see that it worked almost immediately. Best of all, unlike Airflow, it doesn’t randomly crash!
Does it Benefit Data Scientists?
Kubeflow boasts that it saves data scientists from all of the boring stuff they don’t want to do (i.e. infrastructure and deployment), and will supposedly allow them to use the same code locally as in production. This is a bold claim, as in previous projects we have found that it was worthwhile to rewrite entire ML pipelines in order to productionise them.
This claim is true to some extent; data scientists have easy access to the full compute power of the cluster from their notebooks, which are nicely separated from other processes thanks to Kubernetes namespaces. However, every script needed to be converted by hand into a Kubeflow component to get it running in a pipeline. For each one this meant writing a Dockerfile, building a container, creating a component spec, and adding a custom container spec to make sure it ran on the right nodes and requested enough memory. This is a lot of overhead. I have seen others criticise Kubeflow for requiring too much k8s and DevOps expertise, but I think it’s actually pretty clean considering the alternatives. Kubeflow lets you customise everything, which always comes at the expense of simplicity.
The reusability of components is also a big benefit. There are already loads that have been contributed by the community. Unfortunately, we didn’t find a use for any of them, as is often the case in data science projects; our modelling approach evolved and became very specific to the problem, and therefore required bespoke components.
In conclusion then: Kubeflow isn’t helping us much more than a no-framework solution would, but the framework is a nice one to work with, and the customisability is great. We were able to run the entire application from Kubeflow pipelines, and we made development easier for ourselves with custom Jupyter Lab notebook servers.
For this project, the components and pipelines were worked on by the engineers while the scientists focused on modelling, but now that we’ve created these components we should be able to complete our next project without as much engineering assistance. We could be heading toward a future where data scientists never need to worry about infrastructure. This is the core concept of Kubeflow, and is shared by SageMaker…
Why not just use SageMaker?
Amazon’s SageMaker offers a very similar solution, except it’s fully managed, ‘optimised’ for ML, and comes with lots of integrated tools such as notebook servers, Auto-ML, and monitoring.
Managed and integrated does not mean easy to use though. SageMaker pipelines look almost identical to Kubeflow’s but their definitions require lots more detail (like everything on AWS), and do very little to simplify deployment for scientists.
The main reason we chose not to use it, however, is because Kubeflow allows us to keep the entire application portable between cloud providers. Multicloud portability is an underrated but increasingly important factor when making architectural decisions. Vendor lock-in can put companies at a strategic disadvantage by preventing them from using the tools that their competitors have access to, or forcing them to use a more expensive provider.
Our code can run anywhere that supports Kubernetes (so everywhere), but this doesn’t mean we have to forgo all proprietary software; it’s also possible to run SageMaker jobs via Kubeflow components, allowing us to benefit from SageMaker’s tools while sticking to the Kubeflow framework. Then, if we wanted to migrate to GCP, we could do so without a complete rewrite of the whole project. In future projects we can reuse components whether they’re AWS, GCP, or Azure.
Would I Choose it Over Houston?
I don’t have to. Houston could be used to string together multiple separate pipelines, and link them to other tasks outside the cluster.
I would pick Kubeflow over Airflow for an ML project because it scales better, and is a better developer experience. I’d also pick it over SageMaker because it’s simpler, more portable, and has access to SageMaker anyway. I think Kubeflow could one day be an unbeatable tool for data science projects, but it’s not there yet.