On Amazon EKS and GitOps

Dirk Michel
10 min readJul 20, 2022

--

As many have stated before, the term GitOps has come to encapsulate the idea of extending and applying battle-tested developer-centric practices and tools into areas that tend to fall under the purview of operations. The assertion is that the things we have come to value in the software development space can be applied to improve established day-2 operational areas and enable the transition to a “you build it, you run it” paradigm.

“The Way of GitOps” makes us describe the desired state in version control via declarative intent-based specifications that are then continuously reconciled through a control loop mechanism into a run-time system. Defined in that way, one might quickly think of Kubernetes… And indeed, some magic can be found when both of these areas, Kubernetes and GitOps, combine to potentially create something wonderfully useful.

With GitOps, we can push the “developer experience” pattern further and apply it to a wide range of cloud-native operational areas.

Many of us have evolved the need to deploy and run a fleet of Amazon EKS clusters with defined sets of cluster add-ons and user application workloads. Once such environments are successfully deployed, we must keep them updated, upgraded, security patched, and generally adapted to suit our changing needs and requirements. All these moving parts can attract an uncanny amount of toil, undifferentiated heavy lifting, and lead to brittle and error-prone duck-taping.

This post highlights the approach of combining Amazon EKS with the FluxCD GitOps family of CNCF projects to arrive at an efficient day-1 and day-2 cluster fleet and user application environment. The diagram below illustrates the target environment.

Deploying and operating Amazon EKS clusters and user applications with GItOps

For those on a tight time budget: The TL;DR of the following sections is to show that Amazon EKS and GitOps together provide a path towards a developer experience for deploying and operating clusters and applications in a unified way. We describe a way to arrive at a flexible, extensible, and consistent GitOps approach with dedicated git repositories and layouts for platform teams and application teams across a fleet of Amazon EKS clusters.

Let’s unpack our approach on how to arrive at this setup.

Amazon EKS Cluster Initialisation

In the beginning, we define the baseline Amazon EKS cluster “template”, which ought to be the managed service with the essentials only, to arrive at a basic repeatable definition that is generalised and reused everywhere. What determines the specific characteristics of any given cluster will then depend on the declared intent in source control… but more on that later.

The Amazon EKS cluster essentials would instantiate little more than the managed control plane and an AWS Fargate profile… just enough to deploy a set of Amazon EKS managed add-ons. The following diagram illustrates this.

Bootstrapping the initial Amazon EKS cluster and EKS managed add-ons

The baseline fleet clusters can be provisioned according to preference, for example, with Cluster API or with CLI-based tools such as eksctl, terraform and AWS CDK. Equally, there are choices to be made when deciding how to run cluster add-ons on Amazon EKS, but deploying them into AWS Fargate is my preference, as described here.

An essential consideration in the context of cluster add-ons is IAM Roles for Service Accounts or IRSA for short. We typically want some of our cluster add-ons to interact with AWS APIs and automatically provision AWS services on our behalf. For this to happen, we create IAM resources and update our Amazon EKS clusters with IAM role details associated with Kubernetes Service Accounts. This way, we can duly authorise cluster add-ons and provide them with fine-grain AWS IAM credentials.

Cluster add-ons with GitOps

Now we enter the world of GitOps. We need two building blocks for this: a git-compatible source control manager containing our declarative intent definitions and a reconciler agent running on the cluster itself. One could opt for an approach with a centralised agent that pushes changes out into participating clusters, but I’ve come to value a distributed agent approach that places the “smarts” onto each cluster.

Git Layout: The layout of the git repositories is important to get right. A general-purpose pattern that is useful in many situations is to define a single git repository for the cluster fleet, as this lends itself to a typical situation where a dedicated platform team manages the fleet. This keeps your access management clean and tidy.

That git repository would have a particular directory structure. The key concept here is that we isolate the definition of the “catalogue of common things” that we want to define only once and separate them from the various clusters of our fleet that then only need to reference some or all of the available “catalogue items”.

The /common directory in the snippet below would contain the common catalogue items. We then have the convenience of defining our common add-ons only once, and we have the option of propagating changes to all participating clusters. Equally, other sub-directories can be added flexibly as we expand the common foundation over time, for example, with service-mesh or validation workload definitions. The /clusters directory is where we keep our cluster definitions, which associate clusters to a selection of common “catalogue” of definitions we hold in the /common directory.

├── common
│ ├── cluster-addons
│ ├── sources
│ ├── add-ons
│ ├── service-mesh
│ ├── validation
├── clusters
│ ├── cluster-1
│ ├── cluster-1-production
│ ├── cluster-1-test
│ ├── cluster-n
│ ├── cluster-n-production
│ ├── cluster-n-production-specific
│ ├── cluster-n-test
│ ├── cluster-n-test-specific

Operators or other workloads that are not common across clusters but are specific to each cluster can always be added to the applicable cluster directory when needed, as illustrated by the /clusters/cluster-n/cluster-n-production/cluster-n-production-specific directory. Now we can add our FluxCD definition files into the git directory structure.

FluxCD Common Definitions: But what kind of files do we add to the /common directory? FluxCD Operators or Controllers support a range of options beyond straight Kubernetes manifests when consuming workload definitions, and we’ll be using helm charts as our preference. We can use a set of FluxCD Custom Resource Definitions, or CRDs for short, such as the Source Controller’s HelmRepository CRD and the HelmController’s HelmRelease CRD. So we place the various HelmRepository definition files into the /common/cluster-addons/sources directory and we add the HelmRelease definitions into the /common/cluster-addons/add-ons.

As with any helm chart you’d want to consume from a helm repository via the helm CLI, for example, you’d need to define the helm repository it comes from, the name and version of the chart, the release name, plus credentials if it happens to be a private helm repository. We do the same thing here, just in code.

FluxCD Cluster Definitions: Now we add our definition files to the /clusters directory. For each cluster, we can declare pointers to the “catalogue of common things” we want them to have. For example, we can point all clusters to the cluster add-ons, but only some selected clusters would also point to the service-mesh. Notice how the git layout we used affords us this modularity. These pointers can be defined as FluxCD Kustomization files that contain the sourceRef and the path to the “common catalogue” items we want, for example ./common/cluster-addons. Additionally, we can leverage the patchesStrategicMerge facility of Kustomize and pass in any overrides we need whilst consuming the common items. This need frequently arises in the context of different use cases. Some helm charts only deploy correctly when specific parameters are provided as helm values, such as the cluster name, for example. Or we may want to override a given default helm chart value with one that fits a given cluster better, such as defining a different number of replicas for an add-on Deployment.

And here the modularity concept clearly emerges: We can now construct each cluster from common parts by passing in any specifics we may need through patchesStrategicMerge.

FluxCD Operator: The git repository is now structured and populated with our definitions, and we are ready to deploy the reconciler agent into a running Amazon EKS cluster. Conceptually, each cluster equipped with the reconciler agent places a “watch” onto the defined git repository sources and executes a control loop that detects changes in git and then applies these changes to the running cluster. In our case, the cluster add-ons are deployed once FluxCD is bootstrapped for the first time. The diagram illustrates the continuous deployment model of cluster add-ons into AWS Fargate.

Deploying self-managed add-ons into Amazon EKS with FluxCD

New add-on definitions can be amended to Git at any time, and clusters that keep a “watch” or are “subscribed” to the git repository will receive them. The same applies to changes to already defined add-ons, such as add-on version upgrades or helm value parameter changes. The “GitOps Way” can open a window to effectively maintaining and operating larger cluster fleets with a relatively lean platform team.

Application workloads with GitOps

We can apply the GitOps approach to user application workloads as well. GitOps can permeate our mental models and practices across the various layers of an Amazon EKS cluster, and we can also extend it to apply to user applications.

Extended Git Layout: A helpful pattern for user applications is to define a dedicated git repository for our application definitions. Akin to the repo structure we saw for platform team related components such as cluster add-ons, a dedicated and separate git repository can work well in many situations. This will help keep access management tidy and helps enable application teams to take control of their application deployments and configurations. Notice that without that separation, we’d have both platform teams and application teams accessing a single git repository containing files from both teams. This can lead to unintended problems as Git does not (and should not need to) cater for role-based access controls at a directory level.

The /applications directory in the below snippet would contain the available application items and the /clusters directory would contain the FluxCD Kustomization files that associate the clusters with the desired applications we want to deploy.

├── applications
│ ├── app-1
│ ├── app-n
├── clusters
│ ├── cluster-1
│ ├── cluster-1-production
│ ├── cluster-1-test
│ ├── cluster-n
│ ├── cluster-n-production
│ ├── cluster-n-production-specific
│ ├── cluster-n-test
│ ├── cluster-n-test-specific

FluxCD Application Definitions: The application definitions can be helm packaged applications as we saw in the case of the cluster add-ons, in which case we’d define the corresponding HelmRepository source and HelmRelease files. Sources other than helm repositories are also possible, such as buckets and git repositories. Straight Kubernetes application manifest yamls are also interpretable by FluxCD, including applications templated with Kustomize.

FluxCD Multi-Source: The running Amazon EKS cluster would now need to be made aware of the application Git repository. This can happen through the GitRepository CRD, and we can define another git source that FluxCD would commence reconciling against.

In fact, we can define as many supplementary Git repository sources as we need, allowing us to flexibly extend the Git structure as our needs evolve and change over time.

The GitRepository definition file contains the connection details to the application git repository, such as the URL, the path, and credentials if needed. This definition file needs to be placed into the initial FluxCD bootstrap git repository and path of a given cluster, for example /clusters/cluster-n/cluster-n-production/. This ensures FluxCD finds any supplementary Git repository definitions and applies them to the cluster we want the applications to run. Once found, the remaining reconciliation chain is kicked off automatically. Don’t place the file elsewhere, as FluxCD would not pick it up :-)

FluxCD commences reconciling against the application Git repository once all things are in their proper place. The Kubernetes Scheduler would then start placing the application workloads onto available data plane worker nodes. At this point, however, the scheduler will not find any available capacity to place the workloads on. And thus, they may stay in the pending state indefinitely… unless we define a node auto-scaler amongst our cluster add-ons. A popular choice for groupless node scaling with Amazon EKS is Karpenter. Karpenter can detect the pending workloads and provision new EC2 instances according to the instructions we defined in the Karpenter Provisioner CRD. The below diagram illustrates the placement of user applications onto Karpenter provisioned EC2 instances.

Deploying user applications into Amazon EKS with FluxCD

Conclusions

GitOps can be a flexible, extensible, and consistent operating model for building and deploying Amazon EKS clusters and cloud-native user applications. Defining a thoughtful git repository structure that aligns with organisational team structures can significantly enhance the success probability of the GitOps model. This generally accelerates adoption and starts delivering on the “promise” of improved empowerment of the involved teams by extending well-known developer practices and tools to deploying and configuring clusters and applications. The “Amazon EKS and GitOps” combination is a viable and demonstrable path towards a developer experience for deploying and operating clusters and applications in a unified way.

--

--

Dirk Michel

SVP SaaS and Digital Technology | AWS Ambassador. Talks Cloud Engineering, Platform Engineering, Release Engineering, and Reliability Engineering.