Building ML Pipelines

Kubernetes with Argo for the Win

John Aven
Hashmap, an NTT DATA Company
10 min readOct 28, 2020

--

Kubernetes is a generally untapped compute resource. It is that unfamiliar territory between IaaS and PaaS. Many engineers are apprehensive of using it, IT management is fearful of adding it to the stack, and technologists keep begging for it to be used — but why?

  • Another abstraction for pipeline code deployment to engineers is another tech the ‘company’ requires them to learn.
  • It is another piece of tech to support management — which means more stuff & more debt.
  • To the technologist, those are boneheaded thoughts (yes, every technologist has a colossal ego) — there is no reason to hold back progress when you can modernize and move forward with better tooling.

Well, they are all wrong — and they are all correct. The question is how to reconcile this. To do so, you need a business case to work off of. For this situation, let’s address the execution of Data Scientist developed pipelines.

Use Case: Our data science program, while moving to the cloud, is all over the place. They have failed to establish a consistent and reliable approach to solving machine learning problems, and we have been told that the issue is in their workflows. It has been reported that each data scientist, machine learning engineer, and similar across that various parts of my company is using different approaches to building pipelines, building their tooling, and running code across multiple systems — requiring movement of data that may not be necessary. We want to find a solution that allows for all of the workflows to be created in a single way the enforces running the code in a specified and blessed environment. We don’t want the data scientists to know or need to worry about where this is. As we are new to the cloud and want to be prepared for the future — we don’t want to play catch up now to play it again in short order — but don’t want the majority of our technical employees to have to worry about how we have modernized (e.g., learning a lot of new technology to be productive).

At each level (only a few shown here) the tech options can be staggering — and it is easy to dive into analysis paralysis.

While there are many different ways, this can be addressed. In consulting engagements, we’d want to find a lot more information to propose the best fit for your organization; the topic here is Kubernetes and Argo. And yes, this particular combination is a good choice — and for more reasons than expressed in the use cases here. And a mature implementation will be more than what we discuss here, but this will be a good starting point to get your mind thinking on this and its feasibility in your organization.

Kubernetes

First off, what is Kubernetes and why do technologists love it so much? (we do too — after all, we are technologists). For starters, the ability to build an application packaged into a deployable unit that can be deployed on nearly any system is desirable. Imagine not having to make it ‘too’ different for different systems — that means having to worry less about the operational space. This is a big part where DevOps (another discussion) comes in. I am talking about Docker — one of the greatest virtualization technologies out there. What Kubernetes (k8s for short) does is provide an environment to consistently manage (or orchestrate) these containers in a distributed (multiple machines) fashion and the interactions between them with common deployment paradigms.

Having a consistent environment to orchestrate these containers is one thing — but how you do so is also important. Kubernetes provides a declarative interface that allows engineers to clearly and consistently define how things are put together and deployed. These declarations are made in YAML files, known as manifests, which are commonly used to specify application configuration. So that means the declarations are equivalent to defining how an application is configured — convenient, right!

Example Argo Workflow

These configurations or manifest files that declare your applications deployed configuration are given types that represent abstractions of what they are configuring. These configuration files represent abstractions in Kubernetes such as Controllers (Jobs, Deployments, and Daemonsets — which run on all nodes) that deploy Pods (smallest unit of work) and other abstractions like Services that help manage how Controllers interact. The engineering extension to these concepts is Operators, which add additional capabilities or encapsulate highly reusable software units.

Now, you may hear the term tossed around — Cloud Native — but what does that mean? To be clear, building something to run using cloud vendor services does not make your application cloud-native, and using a cloud-native application does not make your cloud-native. What does make it cloud-native is that it is containerized (e.g., Docker) and deployable to a container orchestration platform (e.g., Kubernetes) most often using modern DevOps tooling and development practices.

Evolution of cloud computing from defining the cloud, to running on a cloud provider, to deploying solutions that run on ANY cloud.

Since these platforms can be deployed (or are a service) in many cloud environments and can be used as a core component of a private cloud, these solutions are naturally highly portable. While there will be small differences in environments — once one deployment is prepared, the work required to adapt to a new environment is minimal.

Argo

Now that you know what Kubernetes is, we need to discuss Argo. Argo is a custom Kubernetes operator. It is an operator that is designed to provide a workflow orchestration layer over step-wise processes with dependencies. These step-wise processes with dependencies are called a DAG in most modern computing solutions.

A quick search will lead you to other DAG related tools; you may be familiar with some of them: Apache Airflow, Prefect, Luigi, etc. — but these technologies are not cloud-native. Like all other things in Kubernetes, Argo is declarative, built to run across environments, and manage containerized workloads. In fact, once it is installed on your Kubernetes cluster and extends the Kubernetes CLI, you can deploy the Argo manifest, which declaratively abstracts the idea of a DAG.

Assume that we have the machine learning workflow depicted below:

Clearly, this is an oversimplification of real-world pipelines, but it will serve its purpose.

Each step is written to a docker image and executed in a declarative fashion where the DAG is fully specified in the Argo manifest. This allows for defining reusable docker images that have common pre-canned routines within them to be reused in Kubernetes within a workflow — my discussion around using Docker for ML work can be found here. By allowing an Argo manifest to specify the pipeline, you are declaring the order in which steps are to be executed in your pipeline — and in doing so, you are also saying that everything will be executed on a Kubernetes cluster and that Argo will take care of scheduling, monitoring and tracking the progress of your pipeline.

The benefits (UI)

The Argo UI is just awesome. It provides views of all versions of the pipeline as it was executed. While being able to visualize each pipeline is one thing, and a feature that is very important as a take it or leaves it, it is not the only feature that Argo provides.

Argo’s UI tracks at each execution the STDIO logs of each executing phase — useful for identifying execution issues when they arise.

Now, when you are running your code often, and it seems to be taking longer to train your models than you would expect, you need to track down which stage is taking the longest. Argo tracks this information for you and provides a visual representation of this information in the UI. This feature will show you which stages are taking the longest to execute.

Anyway, Why ML with Argo?

Machine learning is naturally a composable process. Each stage of a workflow, or a collection of stages, can be grouped into a single ‘atomic’ processing step. In general, these are stitched together in custom processes, sometimes with Airflow. But almost always, it comes down to a custom process, even an in-house built solution that is exceedingly rigid (usually written by software engineers that have little to no data science experience). Or worst yet, your orchestration is manual, and you have to move things (data, models, etc.) around to different environments. Don’t do that! Take steps to move to more automated environments. And a Kubernetes environment with dockerized workflows is a great first step in doing so.

Data science is a software engineering practice — even though many data scientists don’t feel that way. But it is anyway. And with this, you must monitor in a centralized fashion all of the various training steps taken. Argo provides this in being able to observe the status of pipelines and the outcome of each step. This isn’t about tracking the ML results, but the software component of what you are doing. See my article on MLflow on approaches to dealing with this unique need.

Next, and perhaps the most important to a Data Scientist, is the need to support polyglotism — developing a solution using multiple programming languages. This is one big reason that many processes are done manually. Different needs dictate different environments. Today this is not needed. Containerized solutions are a great solution to support polyglottism.

Now, when you subscribe to a cloud provider and use their tools (hopefully not dogmatically), you end up locking yourself into their space and forcing your data scientists to use their tools — not tools fit for your purpose. This does not mean you should avoid using solutions from a cloud provider — you need to make the fit correct — and perhaps some of their solution space will work for you. But always be mindful. A cloud-native solution, based on Kubernetes deployments, provides a framework that empowers you to lift and shift to a new provider more easily than using a cloud provider solution. While moving data in migration is difficult enough, moving your data processes is harder — much harder.

The Bad (manual work to create DAG)

Building a DAG in Argo is not very pleasant. From a Data Scientist's point of view, this is a lot more work than desirable. And creating/managing many of these pipelines, possibly with small variations, is a bit messy. This operational work is a little more work than I honestly enjoy doing (all engineers are lazy-facts). I have little desire to specify the template for each stage: where images are located, what the entry-point to my code is, and so on — all this without even building the DAG yet.

The Good (Stay Tune for Links)

It would be much better to write an abstraction on this that builds these files for you from a minimalized declaration of a DAG — and fortunately, such a tool exists. See this blog on LearningAwps. This is a flexible framework designed to help you deploy your ML code with minimal change to your existing processes into an operationally managed environment — moving you closer to production code with ease — the MLOps approach.

How Hashmap Can Help

The next steps are deciding whether MLflow should be part of the data analytics solution for your organization. Hashmap can help you here. Our machine learning and MLOps experts are here to help you on your journey — to bring you and your organization to the next level. Let us help you get ahead of your competition and become truly efficient in your data analytics.

If you’d like assistance along the way, then please contact us.

Hashmap offers a range of enablement workshops and assessment services, cloud modernization and migration services, data science, MLOps, and various other technology consulting services.

Other Tools and Content You Might Like

John Aven, Ph.D., is the Director of Engineering at Hashmap providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers. Be sure and connect with John on LinkedIn and reach out for more perspectives and insight into accelerating your data-driven business outcomes.

--

--

John Aven
Hashmap, an NTT DATA Company

“I’d like to join your posse, boys, but first I’m gonna sing a little song.”