Enterprise Restaurant Compute
by the CFA Enterprise Restaurant Compute Team
The last time we talked publicly about our Edge Kubernetes deployment was summer of 2018.
Since then, we have completed a chain-wide deployment and run it in production for almost 4 years. Every Chick-fil-A restaurant has an Edge Compute cluster running Kubernetes. We also run a large-scale cloud-deployed infrastructure to support our restaurant footprint.
We have integrated with several of our restaurant systems to assist with Kitchen Production processes or onboarding mobile payment terminals used in our Drive Thru. In total, there are tens-of-thousands of devices deployed across our restaurants that are actively providing telemetry data from a wide variety of smart equipment devices (fryers, grills, etc).
Our purpose today is to catch readers up to our current state and share what has changed over the past 4 years. There are still many exciting opportunities for the platform on the horizon, but we’ll leave that for another day…
Where we left off
The goal of the Restaurant Edge Compute platform was to create a robust platform in each restaurant where our DevOps Product teams could deploy and manage applications to help Operators and Team Members keep pace with ever-growing business, whether in the kitchen, the supply chain, or in directly serving customers.
This was an ambitious project and the first to be deployed in our industry at scale.
In researching tools and components for the platform, we quickly discovered existing offerings were targeted towards cloud or data center deployments. Components were not designed to operate in resource constrained environments, without dependable internet connections, or to scale to thousands of active Kubernetes clusters. Even commercial tools that worked at scale did not have licensing models that worked beyond a few hundred clusters. As a result, we decided to build and host many of the components ourselves.
From the beginning, the goal was to build a standards-based platform and conform to well-understood specs and align to industry best practices.
As you might expect, our first release adhered to our overall design goals, but was a little rough around the edges (pun intended). We used an MVP approach and deployed things into the field so that we could start learning.
Then and Now
Where did we start and what has changed over time? Let’s dig in.
We decided to standardize on consumer grade Intel NUCs. Deploying a three-node cluster using these NUCs allowed us a high level of reliability, capacity, and architectural flexibility for HA configuration in the future.
We have not made any changes to this design to date and have been very pleased with this consumer-grade hardware decision, though we are likely to add more compute and memory capacity per node in our scheduled refresh.
For the first release, we landed on using Ubuntu as the base OS. The design was to use a very basic, no-frills image; just a few call-home scripts set to automatically run on first boot to start the provisioning process and configure the node in the cluster.
From the start, our design goal was to enable drop-shipping NUCS to restaurants and require no restaurant-specific configurations to be made manually. In other words, all provisioning is dynamic and on-the-fly (but does have a number of security features baked in that prohibit malicious devices joining a cluster and/or talking to our secure cloud services).
One thing we never shared much about is a service called Edge Commander (EC), which is part of our cluster bootstrapping and management process.
Every edge cluster node is built with the same image that includes a series of disk partitions and some nifty tricks using OverlayFS that ultimately allow us to persist some data long-term (such as the Edge Commander check-in service), but also achieve the ability to remotely “wipe” other partitions on the node, such as the one Kubernetes lives on.
How does it work? Each node checks in with Edge Commander regularly and take work commands in the form of “wipes,” after which the node returns to its base image and then requests the latest “bootstrap script.” It then executes that script and rejoins the restaurant cluster (or creates a new one if all nodes were wiped and nobody else has created a new cluster yet). This allows us to remotely wipe devices and re-provision Kubernetes clusters on devices to react to production issues or upgrade K3s.
This service has worked surprisingly well as it gives us the ability to remote-wipe a node, but its quite scary as a mistake with the code base or bootstrap scripts could have massive implications for our thousands of clusters.
We knew we wanted to standardize on Kubernetes to run our platform and ultimately landed on Rancher’s open source K3s implementation. K3s is a stripped down, spec-compliant version of Kubernetes and has proven to be very simple to set up and support at scale. Since we are not running in the cloud, we do not need many of the cloud service features that make Kubernetes a rather large project. We do try to avoid using any implementation-specific features to allow easy switching in the future as required.
We have been very happy with this decision and have no plans to change in the near future.
When we built our first platform release, there were not great off-the-shelf solutions for a GitOps agent that could run at the edge in a resource constrained environment. We ended up building our own agent called ‘Vessel’ that polls a Git repo (a unique repo for each store) and applies any requested changes to the cluster. It was a simple solution that has worked very well.
We also host our own Open Source GitLab instance in our cloud K8S cluster. We hoped not to take on the burden of hosting our own Git server, but we couldn’t find a cost effective hosted solution licensing model that would work with thousands of clients polling every few minutes.
For GitOps, we opted for a simple model where each location is assigned it’s own Git Repo which we call an “Atlas.” New deployments to a restaurant just require merging the new configuration in the master branch of the Atlas. There are tradeoffs in this approach for enterprise management, but it made deployments, visibility of deployed state and auditing much simpler.
Initial Release Design
Here is a simple diagram showing what our initial release design looked like.
Supporting a Chain-Wide Deployment
One of the greatest challenges we solved was transforming from functional MVP into a scalable, supportable platform that could be maintained by a relatively small team. The fundamentals of the platform were all in place, but there were still manual steps that were required in the provisioning and support processes that needed to be addressed.
API First Strategy
The first order of business was to wrap all of the manual processes and validation check steps into Restful APIs. We created a comprehensive API suite for each of the steps, then built orchestration layers on top to start automating the manual processes.
Creating a comprehensive and well-documented PostMan project enabled us to quickly leverage the new APIs and delay the development of a Support Team Web UI.
We leveraged OAuth to provide granular level access to the API Suite which let us easily lock down specific functions while opening up non-invasive status and reporting endpoints to our customers, which was a huge win.
Dedicated Roll Out Team
How did we roll out so many devices across the chain in a relatively short time?
Our core development team is small and lacked the capacity to support the platform (Edge/Cloud Infrastructure, Core Services, Client SDK), develop new capabilities, and also execute the chain-wide rollout.
We pre-shipped and installed the three NUCs to every restaurant chain-wide in advance of the complete rollout, so all that was remaining was the configuration and verification steps. With our API suite in place, we quickly stood up a semi-technical support team dedicated to rolling out the platform, monitoring the status and solving more straightforward support issues. We leveraged pair support, playbooks and a doc feedback loop to quickly ramp up the rollout team — within a few weeks the team was mostly self sufficient and achieved chain wide rollout within a few months.
We also needed to implement an organized structure to provide exceptional support for the platform while continuing to develop new capabilities and scale.
Our goal is to automate where it is practical, and push the remaining support work as high in the support chain as possible. This frees up our technical staff to continue to innovate and improve the platform.
We accomplished this through a feedback loop between the First Tier Support and Support DevOps teams. All issues initiate through the first tier. When a new or complex issue arises that they are not equipped to resolve, it gets forwarded to the Support DevOps team. The two teams work together to solve the issue, while the Tier 1 team updates documentation and playbooks so they can handle the next similar occurrence. A weekly support retrospective helps feed the Support DevOps team backlog for improvements and Auto Remediation Opportunities. The Support DevOps team also influences the New Development Team’s backlog to help prioritize new tools or capabilities to improve supportability.
This support model has been very successful. The First Tier Support team is able to resolve the vast majority of alerts that arise — often before any issue is even detected in a restaurant.
Monitoring and Auto-Remediation
With over 2,500 active K3s clusters, we needed to improve our monitoring processes to proactively identify and repair any issues with the clusters. A multi-faceted approach was developed.
We established a synthetic client running as a container in the cluster to test our core platform capabilities and analyze problems (service issues, data latency, etc.). When issues are discovered, the client reports to our cloud control plane via an API, which alerts the support team and triggers automated remediation processes.
Since the Kubernetes cluster is self-healing, a node failure does not necessarily represent an outage as workloads are automatically rebalanced between other active nodes in the cluster.
To detect node failures, we deployed simple “heartbeat pods” on each node in the cluster. These pods periodically report status (and a little metadata) to an API endpoint in the cloud. The endpoint applies logic that uses lack of heartbeats to trigger an alert to support staff and to kick off auto-remediation processes if needed.
Leveraging weekly support retrospectives, we quickly discovered patterns between errors, validation, and remediation steps. Since all support tools were API enabled, we were able build orchestration flows on top of the APIs and automated remediation for the most commonly occurring issues.
A simple process example would be a failed node alert in a working cluster. The production support team a) validates the issue by calling a location health API, b) calls another API to remotely rebuild the node, c) waits for the node to come back online, and d) calls the Health API again to verify the node came back up and joined the cluster successfully. If the Node did not come back healthy, we typically repeated the process a few times, then eventually submitted a ticket to our vendor to Hot Swap the bad node. This process was relatively simple to automate and trigger through the alerts infrastructure, an orchestration layer, and the existing APIs.
Adding a few simple auto remediation flows has dramatically reduced the support burden on the team.
As we iterated on improving the support infrastructure, the development team continued to develop new platform capabilities to promote self service and ease of deployment/support.
Our GitOps model was simple. We started by making manual changes early on but very quickly wrote a minimalist tool called “Fleet” that allowed us to take a cluster configuration change (deployment) and apply it to multiple restaurants. This worked, but as the platform grew, we needed a better way for consumers to orchestrate their deployments across the chain and see their deployed versions and deployment failures and successes.
In our second iteration, we created a new Deployment Orchestration API to help teams effectively manage workload deployments. Along with the API, we deployed a matching Feedback Agent on each cluster to report deployments and status back to the cloud.
We also used this to enable the creation of self-managed canary deployment patterns along with automatic chain-wide releases.
As a result of these changes, teams are able to finely tune deployments and have observability over their deployments, resulting in higher-confidence deployments.
In our early deployments stages, we allowed internal DevOps Product teams to have direct access to the restaurant K3s cluster to get status, retrieve logs, etc. as they desired to have them in near real-time. We had a basic log-exfiltration capability but latency challenges and network congestion issues on subpar networks made it very difficult to use.
Given that we desired to minimize remote access to our clusters, we quickly moved to a second iteration where we provided API endpoints to abstract the developers from the cluster, but enabled retrieval of logs and status on demand.
In our third phase iteration (which is where we are today), we added a more robust Log Exfiltration capability.
To provide this capability, we leveraged an open source project called Vector to collect and forward logs from the edge clusters to the cloud. We provided shared compute log collection and a logging endpoint for smart equipment outside the cluster to help with centralized log shipping as well.
Vector provides capabilities for filtering, store and forward, and automated rotation of logs. On the cloud side, we set up another Vector service to collect the logs from all the edge instances, apply rules, and forward the logs to the various tools our internal engineering teams use (Data Dog, Grafana, CloudWatch, etc).
This centralized approach enabled the prioritization (or cutoff) of logging during times of low bandwidth (such as if we move from fiber to a backup LTE circuit) as well as abstracting producing clients from the downstream log destination and its consumers.
We also added the custom capability to increase logging output for a limited time to support real-time production support troubleshooting / debugging.
Metrics and Dashboards
We also added the capability to leverage Prometheus Remote Write to collect metrics from all restaurants and forward to a central hosted Grafana instance in the cloud. Each K3s cluster is capturing metrics on health, nodes, and core service workloads, but we also offer a core service to enable client development teams to publish custom business metrics to our enterprise cloud instance. This model has been a great success and has greatly improved visibility of the platform as a whole across our entire infrastructure.
Today, we are surfacing a wide variety of Grafana Dashboards based on edge data, and are just starting to explore additional proactive monitoring and alerting based on historic trends, capacity headroom, etc.
Today, our Restaurant Compute Platform and its supporting processes are mature enough that we can offer a high level of reliability and customer support with relatively small development and support teams. This gives us a great place to run critical services to help us solve business challenges in our restaurant.
What have we learned?
- It took a lot of great engineering and smart tradeoffs to develop an MVP business critical Edge Compute platform with a small team.
- Operating 2,500+ Kubernetes clusters (with a small team) is hard work, but an API-First, just-enough-automation approach worked great for us.
- Coming from a cloud-first world, some of the biggest challenges at the edge are the constraints (compute capacity, limited network bandwidth, remote access). We would suggest investing a lot of time in learning your constraints (and potential ones) and considering whether to remove them (which takes longer and takes more money) or manage them. For example, we worked around some network management constraints by rolling a series of custom services that worked great, but had a long-term management cost.
We expect to continue to iterate on the platform to improve stability, self-service, and observability, and to add new features as business needs and the technology landscape evolve.