Making Unplanned Interrupts Self-Service with Argo Workflows
Author: Andrew Kim
Being at the tail end of operational demands is no secret for any DevOps engineer. DevOps teams usually work in an environment operationally driven with priorities, often unplanned. Providing the capability for a user to service their own requests is at the core for the DevOps team at SailPoint. Ultimately, we want to automate ourselves out of operationally driven interrupts.
Not too long ago we introduced not only ArgoCD into our toolbelt but also Argo Workfows. Argo Workflows is an open-source CNCF hosted project which was designed from the ground up providing a container-native workflow engine for orchestration jobs in kubernetes. We’ll share some of our approaches to automating ourselves out of certain, repeatable tasks.
A customer of DevOps is really anyone who needs anything from DevOps. Often DevOps teams work closely with not only software engineering teams, but also customer service, professional services, and sales organizations. Unplanned work is no stranger in SaaS organizations. As we grew we began to observe ourselves constantly balancing operational interrupts. We were scripting our interrupt tasks to reduce our time to deliver, but it did not remove the context switching factor. It was clear we needed to automate ourselves out of certain asks and make them self serviceable.
Hello Argo Workflows
In this example we will use a common ask of DevOps across any SaaS organization: onboarding tenants. Even in a multi-tenant architecture, it is sometimes necessary to deploy infrastructure components when creating tenants. Those components can be infrastructure, aliases, services, records, or configs. In an ever-changing microservice environment, this is often challenging because the underlying components must not only exist, but the dependencies also be updated with them. We were able to get our processes down to a handful of operational scripts that could be checked out and ran but it still wasn’t fully automated. We still had challenges that included some scripts that could not run in parallel, some needed a feedback loop, order of operations, race conditions, throttling, and/or better error handling.
Our goal was to make the entire process self-servicing and invisible to the user. This is where we say hello to Argo Workflows. Naturally, the question arises, “Don’t you already have Jenkins? Why not create a Jenkins job for this?” Great question. Jenkins, at its core, is really a CI tool. This doesn’t mean you can’t use it for workflows. Many do and have had success with it. What we were really looking for was the ability to rapidly iterate through workflow development, which was more natural in Argo Workflows. Additionally, deploy automation was already being handled by ArgoCD, so naturally it was the direction to lean. We needed a cloud native tool (which at its core was a workflow engine) that had support for multi-step workflows with dependencies properly tracked, container native, region support, and support for kubernetes. In Argo, each step in the workflow is a container. The containers can run in series or parallel and handlers can easily be added and a graphical UI with dependencies is always a plus.
First, we focused on containerization of the scripts — the bulk of infrastructure components were created and deployed here. Luckily, this was the quick and easy part. We were able to create steps that all kicked off in parallel and Argo WF was able to handle this relatively easily. We updated where needed and linked dependencies appropriately, and ultimately were able to get the template for the workflow started rather quickly.
Secondly, we had to tackle how the workflow would hit certain admin APIs 9that lived internally. This proved to be a bit more challenging for several reasons. It needed to be and stay secure and as this was one of the first use cases of this type we knew what we did would most likely be used as an example for rinse and repeat. Most of the operational tooling lived in an operationally focused VPC. Product focused endpoints lived in their own VPCs. We couldn’t just start poking holes in the security groups. We ended up deciding to peer the Ops VPC to the product VPCs. It came with other benefits opening up options such as: ability to initiate workflows, jaeger spans, and logging capabilities.
Thirdly, after working through the admin api calls, rules, and permissions we had to tackle logging. To keep things light we don’t keep logs too long in Argo WF nor do we want them there. Since we use elastic search and kibana heavily we did not want ARGO WF logs to be an exception. Since we had peered the Ops VPC this was a breeze. We ended up passing all the workflow logs through fluentd into elastic search.
Next, we needed to figure out how to trigger the workflow. After all the end user of this workflow wasn’t going to be DevOps. We couldn’t ask outside teams to login to argo workflows and start kicking off jobs left and right. We needed a solution that would be event driven. This is where Argo events comes into play. Argo Events is an event-based dependency manager for Kubernetes which helps define multiple dependencies from variety of event sources. We came up with a plan to trigger the workflow by publishing to an SNS topic. The solution was, post to the SNS topic, where then the topic subscription filtered on the region & product, ensuring only the SQS in the region specified in the SNS topic payload received the message.
Things in DevOps would not be complete if just left them alone and didn’t monitor them. In this workflow, we are interested in the job statuses and results. Here again, Argo WFs lets you define custom Prometheus metrics to emit, enabling tracking of the state of workflows, duration and failure rates that can be tied in with an alerting setup. It can provide for further enhancements and tuning based on metrics. We decided to push the errors as low priority alerts through PagerDuty and also a slack channel dedicated for these alerts.
So we’ve now found a way to kick off the workflow but we haven’t made it user friendly yet. After all we’re trying to automate ourselves out of certain repeatable tasks. This part was straight forward, as we have an internal tool that serves as a home for DevOps automation and internal tooling with a UI. We simply added a button which was enabled under certain conditions to certain user groups. Our internal tool already handles user auth, permissions, and auditing so plugging it in was straight forward.
A Pattern Emerges
The Argo Workflow solution was first implemented to make frequent requests self-service. This enabled DevOps to rapidly build self-service tools, and enable rest of the organization. As SailPoint added features and products to its portfolio, we found ourselves using the workflow as a framework to add additional onboarding requests to be self-service. The workflow concept would remain the same but obviously infrastructure, admin apis, and data flow would change accordingly. The process has become some what of a rinse and repeat. At high level, the basic workflow for a onboarding a tenant in a product has become something along the lines of: updating infrastructure & configuration, posting to admin apis, and updating statuses. The workflows are designed with enough flexibility to retry specific steps and continue on failures in certain cases which really enables rapid automation and development versus falling back to refactoring application logic.
Essentially, we have taken frequent requests that used to be operational interrupts causing engineers to constantly switch context, and turned them into a workflow plumbed up to our internal UI. This greatly reduces the complexity, handoffs, delays, and SLAs our organization faced. We gained a workflow engine that allowed us to rapidly iterate over workflows, was kubernetes native, and a design that can be used over and over again.
We’ve also integrated Argo WF into other daily use cases such as runbook automation, backup automation, and what we call “the wheel of fun” which randomly chooses persons for volunteers.
Delivery time has improved, engineers are less interrupted, internal customer satisfaction and ability to self-service has gone up. Overall, this enables DevOps to automate rapidly and keep up with not only newly formed features, functions, and requests but also keep pace with the company’s changing portfolio of offerings.