Adopting a Modern Approach to Application Management

Robert Pountney
Workday Technology
Published in
10 min readFeb 14, 2023

Introduction

This article will detail the recent experience of Workday Peakon Employee Voice in modernising its application deployment workflows, embracing GitOps and improving its application management platform to align with industry standards. It will cover the new approach to application management, the design choices taken during the implementation process, key insights, lessons learned and last but not least, some notable features that make for a great developer and operator experience.

Objectives

  • Decouple CI and CD, with CI being environment agnostic and only responsible for creating and publishing application images.
  • Implement GitOps best practices and provide a single source of truth (or rather “desired state”) for app config stored in git.
  • Continuously monitor and reconcile the application’s state on the target cluster to the desired state defined in Git, using a pull-based model to prevent configuration drift.

Tooling

There are many tools that facilitate the decoupling of CI and CD. The idea is to have CI only responsible for creating and publishing application images, while delegating the responsibility of deployment to a CD tool. For our purposes, ArgoCD was selected. ArgoCD’s GitOps functionality allows for the centralization and versioning of application configuration within a Git repository, providing a clear and auditable source of truth. Furthermore, ArgoCD’s automatic reconciliation feature continuously monitors and reconciles the application’s state on the target cluster to the desired state defined in Git, using a pull-based model, which is beneficial in preventing configuration drift in a hands-off manner, without fixing broken pipelines.

Architecture

Diving into the specifics, we decided to host ArgoCD on its own dedicated “Management” cluster, with other environment clusters (for example, dev, staging, prod etc.) as target clusters that Argo can deploy to. The alternative approach would have been to install ArgoCD on each environment cluster.

We weighed up the pros and cons of each:

Management Cluster:

Advantages

  • Single view for deployment activity across all clusters
    — Great developer experience
  • Single control plane, simplifying the installation and maintenance
    — Great operator experience
  • Single server for easy API/CLI integration

Disadvantages

  • Scaling requires tuning of the individual apps and components
  • Single point of failure for deployments (If a cluster constitutes as “single”), increasing risk of downtime.
  • Admin credentials for all clusters in one place
  • The requirement to maintain a separate “Management” cluster.
  • The consideration of significant network traffic between ArgoCD and other clusters.

Instance Per Cluster

Advantages

  • Distributes load per cluster
  • No direct external access is required.
  • Eliminates traffic leaving the cluster.
  • An outage on one cluster will have no impact on other clusters.
  • Any credentials are scoped per cluster.

Disadvantages

  • The requirement to maintain multiple ArgoCD instances and duplicate configuration.
  • There is no single view of all clusters, which would mean needing to switch URLs to view each cluster though UI.
  • At a particular scale, each instance could still require tuning.
  • API/CLI integrations need to specify which instance to communicate with.

Reviewing the above, we can appreciate that the ‘instance per cluster’ architecture approach would allow for greater separation of concerns, and provide greater protection to the production environment. However, these advantages come at a cost. This approach would mean having to maintain ArgoCD (and any additional cluster-components) in multiple places, forcing the developer to switch URLs for each cluster view. This is an additional administrative burden that cannot be ignored.

In the end, for our specific requirements, we decided the benefits of the management cluster outweighed those of the alternative. At the time of writing, scaling wasn’t a huge concern for Workday Peakon Employee Voice. We had only a few apps (approximately 10–15) per environment which we felt was manageable on a single cluster. Should this change, we noted the “Management Cluster” could scale horizontally by adding more nodes if needed. In addition, tuning can be applied across all apps should we need to leverage it. The manifest-generate-paths annotation, as an example, means cached manifests are not invalidated upon each new git commit, reducing the load on the cluster and thus improving performance.

Finally, the greater developer and operator experience of having all applications in one place was of greater importance to us than the security and redundancy (HA) benefits we would gain by splitting ArgoCD out per cluster. We concluded that any security disadvantages posed by the “Management Cluster” could be mitigated by implementing proper Role-Based Access Control (RBAC) to restrict access in the single cluster, rather than restricting access at the cluster level.

Application Configuration Repository

Once we had decided on the service architecture, we moved on to considering how to manage application configuration. Our primary aim was to decouple the application configuration from the application source code repositories. For this, again we had two options. To either create an application configuration repository for each application, or seek to maintain a single source for all apps. We concluded that the operational overhead of maintaining additional repositories (which would only grow further over time with each new application) was not feasible nor desirable. Thereby, a single application repository was born.

Our single repository contained the following:

  • Configuration for each application per its environment (for example, development, staging, production etc.).
  • Configuration for any cluster-components (for example, any services running in a particular cluster that any apps depend on).
  • Perhaps most importantly, configuration for ArgoCD itself on the management cluster.

We felt this satisfied our primary aim whilst providing efficiencies to our team, and thus for Workday Peakon Employee Voice.

Lessons Learned

The migration to a new system for managing and deploying applications can have its challenges and uncertainties, and our experience with this process was no different. However, with time and effort, we were able to overcome these obstacles and uncover some valuable insights. I will share some of the key lessons learned during our journey that may not be immediately evident to those who have not undergone a similar experience.

Image Updater & Rollbacks

ArgoCD provides an automated solution to application deployments, but it does not provide a way to update the application configuration, specifically frequently changing config values such as image tags. Images are updated regularly with every change to the application source code, meaning the configuration is always changing. This can be remedied by updating the application configuration as the final step in your CI pipeline. Alternatively, images can be updated via an agent running on the management cluster. This allows for the workflow to be standardised and removes the need to introduce custom steps to pipelines.

ArgoCD image-updater is a separate tool from the core ArgoCD system. It operates as an agent within the cluster, continuously monitoring for new versions of images that have been deployed. When a newer version of an image is found in your container registry, such as AWS Elastic Container Registry, image-updater will automatically update the image on the cluster. This streamlined process helps to ensure that deployed images are always up-to-date and running the latest version.

With our first iteration of the Argo image-updater we decided to use the git write-back method, which allows for a full GitOps approach. This means when an image is updated via our CI pipeline and pushed to ECR, the image version is pushed to git as a commit that updates a .argocd-source-<appName>.yaml file and forms part of the desired state for app configuration.

With the git write-back method, we discovered that the revision history in ArgoCD can be somewhat unsuitable for rollbacks. In our case we had multiple environments that would be updated at the same time if there was an image update in ECR. This shouldn’t be problematic, but we found that when using the git write-back method, every time a new application image was pushed, multiple commits (per environment) were pushed back to our application configuration repository as part of the image-updater process. For example, upon a fresh image push of app1 to ECR, app1-dev and app1-staging (2 separate application resources in ArgoCD) would both have commits generated, updating the .argocd-source-app1-dev.yaml and .argocd-source-app1-staging.yaml respectively in app config repo. When multiple commits happen in a short timeframe, ArgoCD is unable to sync the application resources at the same rate as the commits being generated. Unfortunately, this results in Sync operations covering multiple commits rather than just one.

This has significant implications. Firstly, revision history entries for the application resource of app1-dev can point to a git commit that updates .argocd-source-app1-staging.yaml. This happens because between Sync operations there are multiple commits, one of which updated app1-dev. It just so happens that the commit that updates app1-staging is the most recent commit at the time of the Sync. Therefore the app1-staging commit is the one that is referenced in the sync revision history for app1-dev. Therefore, if a developer wished to rollback to a previous application revision, it isn’t completely obvious which revision to rollback to, as the associated commit could be completely unrelated to that application resource.

To get around this issue we experimented with the argocd write-back method instead, which interacts directly with ArgoCD API rather than creating a git commit. When a new image is pushed to ECR, image-updater triggers an update of the app in ArgoCD directly rather than updating the image tag in git. This direct update is essentially a parameter override of the image-tag applied to the latest “revision” of an application.

Using the argocd write-back method for all applications means each application “revision” is associated with an application configuration change. The version of the application source code that is deployed is represented by the image-tag applied to that “revision” set via parameter override.

This removes any messy commit history in the application configuration repository. Further, it means every application image update will be represented by a “revision” with an overridden image-tag. The obvious downside of this is that k8s-config would no longer be the single source of “desired state” for the whole configuration, as the current image tag would not be there. However, this downside is offset by providing a revision history that can be more easily interpreted by any developer that may need to rollback.

Quality of Life Features

Our team was pleased to uncover several features that greatly improved our workflows, with minimal configuration. Here are a couple of features that we have found most useful.

Preview Apps

Preview apps allows developers to test functionality without merging their changes to the master branch, facilitating faster feedback loops and mitigating risks of breaking changes.

Preview apps are individual apps spun up on the target cluster based on PR feature branches in GitHub. Making use of the Pull Request Generator, ArgoCD ensures the latest version of the code from the PR branch is deployed to the corresponding preview app if a “preview” label is present. In our case the naming for the preview app consisted of the underlying app name and PR number, this way the preview app has a unique name in the cluster.

Notifications

ArgoCD Notifications continuously monitor applications and allow us to notify users about important changes in the application state. Many services are compatible with ArgoCD notifications including Slack, the primary communication platform used by Workday. It also has the ability to create custom notifications with other non-supported services via the use of webhooks, which allows sending a generic HTTP request using a templated request body and URL. In our case, using triggers and templates, we were able to configure slack notifications for all important application scenarios that were important to us. Triggers describe the scenario when a notification is sent and templates generate the content of the notification message.

Slack

  • Deploy notifications — When an app is successfully deployed (Note we have different slack channels per environment, so if a staging app is deployed then the notification will be delivered there).
  • Sync failed — when an app has failed to sync, meaning the resources could not be spun up successfully (i.e. deployment could not pull image).

GitHub

For preview apps, it was useful for the developers to be able to see the status of their preview app (is it deployed or not). So we set up a GitHub notification via webhook to update the most recent commit of the PR with a status to show the user if the preview app was deployed or not. This notification would update the status based on three trigger scenarios, when the app is initially being deployed (running), when the app has deployed successfully (succeeded) or if the app failed to deploy for some reason (error, failed).

Buildkite

Finally, we also used ArgoCD notifications in a non typical way, that being to trigger integration tests (via web-selenium) based on the Argo application state. If a staging application was successfully deployed and deemed ‘healthy’,then the system would trigger a webhook with json payload to start a Buildkite build that will perform the integration tests. Even though this was not the intended use for ArgoCD notifications, we found that it works nicely and is easy to set up without the need for custom scripts or additional services.

Closing thoughts

Migrating applications to a different platform will always pose its challenges. Workday Peakon Employee Voice adopted ArgoCD for its hands-off approach to application deployment. Currently, although our setup is still being fine-tuned, it has generally been performing quite well.

Notably, the monorepo structure which has been effective with our smaller number of applications. We recognize that as our platform expands and more applications require management, it is likely additional scaling considerations will need to be evaluated. Presently, however, we have managed to find a good compromise when it comes to absolute “GitOps” vs developer experience, especially when it comes to app rollbacks, by making use of the argocd write-back method. Additionally, developer experience has been improved by simple to use preview apps, and ArgoCD notifications have allowed us to easily setup slack notifications for key events such as application deployment and failure.

In conclusion, we have successfully accomplished our goal of separating CI and CD whilst also providing a platform for applications that is self-healing. The user interface provides a clear and user-friendly overview for developers, and the automatic reconciliation feature has reduced the workload for operations teams.

--

--