Stumbling Into GitOps at the Edge

Published in

chick-fil-atech

5 min readMay 16, 2023

by Brian Chambers, Jake Wasdin, Jamey Hammock, Captain Jerry

What do you call git-based deployments?

In the winter of 2017, we were thinking about how we would deploy and manage a massive fleet of Kubernetes clusters across each of our Chick-fil-A restaurant locations to help us achieve our Restaurant Capacity and Internet of Things goals.

The whole purpose of the Edge was to run applications that could interact with our IOT ecosystem and create solutions that make restaurants more efficient and effective and make great experiences for Customers, Operators, and Team Members. We knew that our edge platform would need to provide the capability for last mile deployment. What does that mean? The previous miles are an application team’s pipelines and processes that ultimately produce a production-ready artifact (checked into an artifact repo) and its associated configuration. The last mile is getting that artifact and configuration deployed to and scheduled in each restaurant on the timeline the team defines.

One of our goals with edge deployments was to allow multiple teams to deploy their application across our fleet of clusters at various cadences. Team A may make small changes that have minimal user impact and can be released to the entire fleet in as single day (or hour). Team B may have to train restaurant team members and roll out at a much slower pace. We wanted to support both of these paradigms, while ensuring that all clusters in the fleet eventually converge on a “golden image”. In short, not all clusters will be the same at any given time, but will all end up the same eventually (over days, weeks, or possibly months).

We weighed many of the popular deployment tooling options at the time (most of which I have forgotten) but elected to do something a little weird. We decided we would use git repositories as a tool for storing our deployment configurations and that we would synchronize changes from git repos in the cloud down to restaurants with a lightweight agent at the edge.

This “git deployment” approach had a few obvious benefits…

declarative edge deployments
reproducible cluster configurations
developer-friendly and familiar tools
native change tracking
familiar and reasonable API (git)
potential for clean rollbacks

Shortly after, we presented about our Edge architecture at a conference (QConNY 2018) and learned that a few others were using this approach and had named it “GitOps” (thanks to Alexis from Weaveworks who introduced me to the term).

How GitOps works

Here’s how our architecture works.

Atlas

An “Atlas” is a git repository that contains configuration for a restaurant location’s cluster and all deployed applications.

Vessel

Vessel is a custom Golang application that lives in-cluster in each restaurant. Its job is to pull changes from git via the API and apply those changes to Kubernetes via its API.

Scaling to 2,800+ Restaurants

Since we have ~2,800 restaurant locations, we have ~2,800 Atlas git repos. One per restaurant. This supports our design goal of being able to apply application changes to arbitrary groups of restaurants at the cadence desired by each application team.

Edge Deployment Orchestrator

In order to make this approach scale, we needed to invest in automation, specifically for templating a given change across 1..n Atlas repos.

Deployment Orchestrator is an API to programmatically deploy components to edge clusters in the Chick-fil-A ecosystem. It provides additional protection by restricting clients to only deploy to application directories designated for them. The directories designated for clients are pulled from the token used to authenticate any given request.

The deployment orchestrator API has two main functions:

It orchestrates the deployment of a component into the atlas repositories for requested locations.
It keeps track of the deployment status of the component at each of these locations.

When an application team initiates a deployment, they specify the list of locations they wish to deploy to. This can be all locations, an enumerated list of locations or — more often preferred — a saved named group of arbitrary locations that we call a “deployment group”.

The orchestrator rolls the changes out to the target locations in a set of stages configured with a certain size and success/failure threshold that the deploying team can define. If a sufficient percentage of the deployments to meet the threshold don’t succeed, the deployment stops. This is, in essence, a canary deployment.

“Roll out” means writing changes to the Atlas repo. These changes are applied asynchronously via Vessel running in each cluster.

To determine a component has successfully deployed, the orchestrator relies on an agent running at the edge that interacts with the Kubernetes API to watch the status and health of deployed components. As the versions, rollout status, and health of specifically-labelled resources in the cluster change, the agent relays that information to the orchestrator. When the orchestrator sees the component updated to the expected version, rollout complete, and all containers in a healthy state, it will consider the deployment to that location for the component complete and successful.

One downside of our current approach is that it is “all-or-nothing”. One bad config that fails to apply at the end can prohibit other changes from being applied. We also have to manage this system ourselves.

The secret about secrets

What about secrets deployments? Here’s how they work…

On the cloud side, we have an instance of Hashicorp Vault. This Vault is the repository of all local secrets for all edge clusters.

When templating configuration for applications, we apply tags on the “deployment” objects in our yaml to map which applications need which secrets.

At the edge, we have a custom Kubernetes Operator pattern “secrets operator” that polls our cloud-deployed instance of Vault and then applies any new secrets in-cluster. The operator then restarts any pods that are labeled to care about that secret.

Enabling teams

Application teams are able to consume the Deployment Orchestrator API in order to self-service deploy their applications across the fleet at any time they desire.

Future Considerations

Over the last 4 years, a number of solid GitOps tools have emerged and matured around us; Flux and ArgoCD in particular. We use ArgoCD in all of our cloud environments already. At some point we will likely pivot to an OSS tool instead of our custom solution. Since our solution is fairly simple, easy to understand, and has not caused us many issues… we have deferred that move so far. However, it is not necessarily advantageous to maintain a custom tool if there is a good open source tool that serves the same purpose.

One disadvantage to our current tooling is that it is relatively slow to execute the deployment of changes across so many git repos. The throughput of our git server ends up being a limiting factor for large scale deployments or rollbacks. While this is tolerable in good circumstances, it can result in delayed outages or degradation in performance when doing a massive rollback of a change across the entire fleet. It is also challenging to query our entire collection of Atlas repos to see what is deployed where.

Our team continues to work on these items, and more, as we continue to mature our practice over time.