GitOps on Kubernetes with ArgoCD

This is the first post in our series about Managing Complex Kubernetes Clusters. We introduce how we used ArgoCD to enforce GitOps by preventing any alternate means of deployment to your cluster other than through a commit in your GitOps repo.

by Fong Han Ken and Brian Claus

As noted in our introduction, we are building a Kubernetes native PaaS for one of our clients. Their current platform runs on thousands of VMs and grew organically over six years. It is excruciating to manage as what is running on each machine is opaque; no one precise understanding of what’s running on all the machines and what changes have been made.

Complex Infrastructure System — Source ITWire.com

Infrastructure provisioning is managed using complex imperative Jenkins pipelines which are proprietary and where no one person truly understands all dependencies. There are also complex sets of pre and post-scripts that are utilized to validate the state of the system each time a component is deployed. A deployment failure of one component or an error in the pre-/post-scripts would halt further deployments and requires manual intervention. The Jenkins pipeline code and the pre-/post- scripts are commonly the only source of documentation on how the system should run in production. Troubleshooting relies on a shrinking pool of operations engineers who have been with the project since the beginning and whose knowledge is irreplaceable.

Entering the world of GitOps

What is the craze about this new thing called GitOps? In short, it’s just version control and automation. GitOps boils down to automating the deployment of a desired state of the system via a repository (repo) with version control. Amendments to the system must be committed to the repo in order to have full traceability and auditability of changes. The repo contains declarative specifications for all applications, components and environments, removing the toil and any complex imperative dependency management or pre-/post-scripts checks. It also fits very well with Kubernetes.

We follow four tenants for GitOps:

1. The desired state of the system is stored, and version controlled, in a repo

2. All amendments to the system state are traceable and auditable through commits within the repo

3. All environments are declarative specifications (i.e. dev, sandbox, stg, prod) within the repo

4. Changes are only made via commits to the repo

GitOps requires tooling to pull your system specifications from a repo and deploy them to infrastructure that will run your workloads. In our use case, we were deploying Kubernetes. Ideally your tooling should also enforce GitOps by preventing any alternate means of deployment to your cluster other than through a commit in your GitOps repo.

ArgoCD

For tooling, we chose ArgoCD over Flux, for a number of reasons (as of the time of the decision making):

1. Provides a UI which is friendly to operations teams tasked with running the platform post implementation

2. Offers out-of-the-box support for managing multi-cluster and multi-repo deployments

3. Provides a sync-ing feature that enforces the specifications in the repo

4. Detects changes and rolls back any changes that deviate from the repo with the help of a policy engine

5. Integrates with Helm and Kustomize seamlessly

6. Widely used with productive and active open-source community

We would like to highlight point 4 because of the importance to enforce the GitOps workflow and ensure no changes are made outside of it. This requires an additional tooling such as Open Policy Agent (OPA) or Kyverno to enforce policies during our deployments and prevent alternate means of deployment.

How does ArgoCD work?

ArgoCD breaks out the deployment part of the CI/CD pipeline. It pulls new image versions from the docker registry and updates the repo with the new definition for deployment. It also periodically synchronizes the specifications in the repo with the cluster to ensure that the cluster maintains its desired state according to the repo. ArgoCD is agnostic as to how images end up in the registry, hence, a separate tooling and process is used for the CI.

ArgoCD uses two primary Custom Resource Definitions (CRD) for deployments:

1. Argo Application — Each app represents a single source code repository where your Kubernetes configuration resides for deployment to the cluster. The repo should consist of all resources (helm templates, kustomize or yaml manifest files) intended to be deployed by ArgoCD

2. Argo Project — Groups Argo applications into sub-categories

To live by our rule of enforcing that the cluster maintains its desired state according to the repo, we configured and deployed ArgoCD using an Argo Application CRD.

How does it all begin?

When ArgoCD first starts up, the “Apps of apps” is deployed as an Argo Application CRD which then deploys all the other Argo Applications that contain the configurations of all the desired tools wanted in the cluster.

Use Case — Air-Gapped Kubernetes Environment

We have created three set of projects…

1. Default (which is always present) contains ArgoCD

2. K8s Infra contains all of our K8s related tooling (i.e. Prometheus, FluentB, Isteo etc.)

3. PaaS Apps which is all the applications that form our platform as a service (i.e. Airflow, Spark, Trino etc.)

ArgoCD’s synchronization feature can be suspended for debugging, troubleshooting and experimentation. However, if it is turned back on before all your manual changes to ArgoCD resources have been captured and committed back to the repo, it will undo them.

We enforce our GitOps model by blowing away our stage environment after every sprint to test whether the platform comes back up with all our new changes. This was a hard lesson when we lost two weeks the first time we blew away our stage environment. We had spent three months of build activity coupled with a lack of discipline with our build versus stage environments.

The practice we have since adopted is to build in our build environment and then deploying to stage environment. However, some build activities must be done directly into the stage environment due to dependencies on internal client systems.

We also use BitBucket for staging and GitHub for our build activities. Common problems we ran into after blowing staging include:

  • Naming collisions — we initially used ‘stage’ to describe a dev environment configuration in GitHub and the staging environment in the client environment
  • Incorrect images, image versions and image names in the client internal image registry. There is a different import process for internal images as the environment is air-gapped. Our build environments are not to give us flexibility.
  • Naming conventions used in helm templates
  • Client specific configurations missing from our external development environments

Other than losing work after re-enabling sync-ing in ArgoCD — it overwrites manual changes we missed while troubleshooting, we have not had many ArgoCD specific problems. Our staging environment is air gapped, which makes on-the-fly troubleshooting difficult but really has nothing to do with ArgoCD.

ArgoCD has simplified the deployment of a complex PaaS to the point where we can (or anyone can) spin up or destroy copies of the platform at will either on-prem or in the cloud.

Related topics

One-click self-deployment with ArgoCD

The below is a snippet of how an Argo Application CRD is configured — every Argo Application points to a repo containing the specifications for the cluster as well as the endpoint of the cluster.

Example ArgoCD application file

We’ve created 3 sets of projects…a bit more about what we learned

  1. Default (which is always present) contains ArgoCD
  2. K8s Infra contains all of our K8s related tooling (i.e. Prometheus, FluentB, Isteo etc.)
  3. PaaS Apps which is all of the applications that form our platform as a service (i.e. Airflow, Spark, Trino etc.)

For debugging, troubleshooting and experimentation ArgoCD’s synchronization feature can be suspended; but if it is turned back on before all your manual changes to ArgoCD resources have been captured and committed back to the repo, it will undo them.

We enforce our GitOps model by blowing away our staging environment after every sprint to test whether the platform comes back up with all our new changes. This was a hard lesson when we lost 2 weeks the first time we blew away our staging environment. We’d spent 3 months of build activity coupled with a lack of discipline with our build vs. staging environments.

Since, we have kept our discipline in building in our build environment and then deploying to staging. But some build activities must be done in staging due to dependencies on internal client systems. We also use bit bucket for staging and github for our build activities. Common problems we’ve run into after blowing staging are:

  • Naming collisions — we initially used ‘stage’ to describe a dev environment configuration in GitHub and the staging environment in the client environment
  • Incorrect images, image versions and image names in the client internal image registry. There is a different import process for internal images as the environment is air gapped. Our build environments are not to give us flexibility.
  • Naming conventions used in helm templates
  • Client specific configurations missing from our external development environments

Other than losing work after re-enabling synching in ArgoCD — it overwrites manual changes we missed while troubleshooting, we haven’t had many ArgoCD specific problems. Our staging environment is air gapped, which does make on the fly troubleshooting difficult but really has nothing to do with ArgoCD.

ArgoCD has simplified the deployment of a complex PaaS to the point where we can (or anyone can) spin up or destroy copies of the platform at will either on-prem or in the cloud.

The other two posts in this series are:

One-click Bootstrap Deployment of ArgoCD

Structuring Your Repo for ArgoCD

Other resources we found useful

Understanding ArgoCD: Kubernetes GitOps Made Simple

ArgoCD — Declarative GitOps CD for Kubernetes

FluxCD, ArgoCD or Jenkins X: Which is the right GitOps tool for you?

Machine Learning Helps Manage Complex Infrastructure — Source of the gorilla diagram

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Versent Fast Forward

At Versent, we help clients effectively adopt cloud services to implement digital business capabilities. We would like to share our work and things we create.