Hundreds of Weekly Deployments Managed with GitOps

Published in

chick-fil-atech

7 min readNov 8, 2022

Today, a significant portion of our sales are flowing through our digital channels, including The Chick-fil-A App and chick-fil-a.com. We have seen significant growth in these channels since March 2020, when the impacts of COVID-19 started in North America.

Our digital properties are powered by our Digital Experience Engine (DXE). DXE is a cloud-based microservices architecture composed of about one hundred services, running in our Kubernetes-based application platform. We have hundreds of developers working on DXE and, collectively, they push thousands of commits and open hundreds of pull requests every week.

Our deployment process is entirely GitOps-based and all our developers need to traverse this process to get their code deployed every single day, with no formal blackout periods for deploying code. So, how do we do it?

Why GitOps?

GitOps is a pattern for managing the state of Kubernetes clusters using git as the source of truth. The entire state of the cluster is declared in manifests stored in a git repository, which we call an atlas, and any changes to the manifests follow well-known git processes (feature branches, pull requests, etc.). Once the manifests are version controlled in git, then we use a state reconciler called ArgoCD that automatically applies changes from the repository to the cluster.

The principles and advantages of GitOps are described in Weavework’s Guide To GitOps. We won’t repeat all of them here, but the key ones for us were:

1. The state of the cluster is described declaratively. This means the cluster state can be treated as code and managed like any other codebase.

2. The state is versioned in git. This makes deployment processes look like regular git processes. Deployments can be staged and reviewed in feature branches and pull requests. Rollbacks can be applied via git revert, and in general, we can lean on git’s performance, security and flexibility when managing the state of the cluster.

3. Changes to the cluster state can be applied automatically. Developers can start deploying applications without needing escalated permissions to the cluster (or any permissions at all). This is a big advantage for us because our development teams tend to know git processes significantly better than Kubernetes tooling.

Traffic Pattern

Let’s look at the number of requests through DXE on a typical day:

There’s nothing special about this day… we just picked one. (All times are US/Eastern.) This figure probably isn’t terribly surprising.. Requests ramp up during breakfast, spike at the peak of the lunch rush and hit another, smaller peak during dinner.

Overall, we exceed ~300K rpm at lunch, average about ~120K rpm and handle more than 150M requests during a given day. If we’re running a promotion, these numbers can easily double.

Architecture

Next, let’s review the high-level DXE architecture:

It’s relatively straightforward. Requests from clients are routed to services running in Kubernetes. We currently have clusters in multiple regions and the setup is identical in each. We regularly route traffic to different regions to ensure we don’t have any sneaky, hidden dependencies in a particular one.

Our services are mostly Java Spring Boot applications with some Go and Python applications mixed in. The applications are backed by NoSQL databases, and the data in these stores is streamed to a data warehouse to support real-time analytics. The “digital-born” orders (orders that originate from a digital property) are combined with the traditional orders coming from the point-of-sale (POS) terminals in restaurants, giving internal teams a complete view of transactions in near-real time.

Let’s zoom in on the Kubernetes platform that runs DXE in above figure:

We think of Chick-fil-A’s Application Platform (CAP) as a distribution of Kubernetes. As shown above, CAP is a series of layers: k8s, sys, core, and app. This certainly isn’t the complete list of components. The image would be unreadable if we added them all, but hopefully this provides an understanding of the kinds of components that are included in CAP. These layers are meant to be composable with teams free to add or remove components to taste. CAP, however, does have opinionated defaults, so folks don’t have to worry about choosing components …unless they want or need to.

The base k8s layer is Amazon’s Elastic Kubernetes Service (EKS). On top of this layer is the sys layer, which includes:

• external-secrets for integrating AWS Secrets Manager with native Kubernetes secrets

• external-dns for managing Route 53 entries

• the bottlerocket update operator for updating and patching our nodegroups

• the cluster autoscaler for scaling the cluster

• the aws-load-balancer-controller for managing AWS load balancers

• … and a variety of other cluster and service metric providers

Next up is the core layer. This is the largest layer of CAP and where a team might switch out a component or two. For example, ArgoCD for our proprietary, in-house GitOps operator. But, by default, you get:

• ArgoCD as the GitOps operator,, the heart of CAP. Argo watches the atlas, applying any changes to ensure the cluster is always in sync with the declared state in git.

• the cloudwatch-adapter so that we can scale our services based on cloudwatch metrics

• fluentd for parsing and routing logs to various log sinks

• the Open Policy Agent (OPA) for creating and maintaining admission controllers

• prometheus-operator for maintaining the prometheus and grafana stacks

• thanos-operator for managing long-term storage of prometheus metrics in S3

• cert-manager for obtaining, renewing, and using certificates

• … and a variety of other tooling, including operators to manage advanced deployment patterns, track costs, manage CI/CD jobs and mock frameworks

Sitting on top of CAP is the app layer where our developers work. Obviously, our teams rely on the components provided by CAP, but we draw a clear line of responsibility. CAP is responsible for the k8s, sys and core layers, while developers are responsible for the app layer.

Deployment Process

How do teams deploy their applications to the app layer? The deployment process looks like:

Let’s walk through the workflow step-by-step:

1. The process starts with a developer pushing a commit to the mainline branch of the application repo. This triggers a build workflow in GitHub Actions.

2. The application container is built, all tests are run, and the image is pushed to our Artifactory registry.

3. The workflow pulls the base manifests for the application type from our “feeder” kustomize repos. These feeder repos are worth a dedicated blog post on their own, but at a high-level, they contain our default manifests for various application types (think java-api, go-api, python-api, or react-app). The default manifests are merged with any application-specific overlays found in the application repo using kustomize. Developers are free to patch anything in the base manifests, but the defaults should work for most applications. This provides teams an easy button for deploying their applications to Kubernetes without feeling overwhelmed by having to write and maintain a (potentially gigantic) pile of yaml.

4. The complete, merged application manifests are then committed and pushed to the atlas.

5. ArgoCD watches for changes to the atlas, and once the new manifests are pushed, it will pull down and apply them to the cluster.

6. The new image is pulled from Artifactory.

7. The new version of application is deployed to cluster.

That’s it!

What’s Next?

Currently, this process is implemented almost entirely using shell scripts in GitHub Actions. We’ve thrown virtually all UI/UX out the window in the name of GitOps. While we feel good about that trade off, it’s one we shouldn’t have to make. We want both GitOps and a nice UI/UX. We would love to partner with the community to help solve this problem.

We also only briefly touched on a critical component: the feeder repos that house our default manifests. Creating these from scratch is non-trivial and time-consuming, requiring expert knowledge of Kubernetes and broad understanding of each teams’ needs and use cases. It’s tough to get right and we’ve suffered through degraded performance and outages iterating on these manifests.

We are actively working on and investing in our Kubernetes and GitOps platform at Chick-fil-A. If this is an area of interest for you, we welcome feedback and community partnership on this. Feel free to leave us a comment or send me a message on LinkedIn.

Hundreds of Weekly Deployments Managed with GitOps

Written by Chick-fil-A Team