Observability at the Edge

How Chick-fil-A provides observability for 2,800+ K8s clusters

Brian Chambers

Published in

chick-fil-atech

8 min readAug 8, 2023

by Brian Chambers, Alex Crane

Managing 2,800 Edge Kubernetes Clusters

Chick-fil-A runs a fleet of Edge Kubernetes clusters in each of our ~2,800 restaurants to enable highly available, business critical workloads to run without internet dependency. On top of this fleet we run a set of platform services that provides services for things like identity management, application deployment, and observability across the fleet. In addition, we also manage hundreds of thousands of internet of things (IOT) devices such as fryers, grills, or tablets (user interfaces) that send billions of MQTT messages each month in addition to operational metrics and logs.

Operating 2,800+ discreet footprints of Kubernetes is not without its challenges. The same is true for any of the edge applications deployed by our internal teams, which are customers of the platform. Anyone running an application needs strong observability capabilities and this is certainly true at the Edge, too. In this post, we’ll share about our observability architecture for the edge at Chick-fil-A.

Challenges at the Edge

The edge brings a lot of unique challenges and constraints that are somewhat foreign to the modern cloud native mindset.

Logging: There are simply too many application logs to push debug from every Kubernetes pod and IOT device to the cloud in realtime.
Bandwidth: In some of the early iterations of our services, we consumed all of the available bandwidth that was available by shipping back loads of extremely verbose logs, even to the detriment of credit card processing (not good!). Breaking production to observe production is not ideal.
Cost: Cardinality of metrics across 2,800 locations can become really expensive and cost prohibitive in proportion to the business value added.
Heterogeneous Origins: We collect logs, metrics and telemetry data from many types of applications including platform apps, other team’s business apps, vendor-developed apps, and in-restaurant IoT devices.

Edge as a Platform

Our Edge environment provides a platform that internal teams can build systems and services on top of. A good metaphor would be the way AWS or Google provide cloud platforms that allow you to build applications, but do not dictate what tools you use or how you use them. We follow a similar paradigm, allowing teams to have “freedom within a framework” to pick different programming languages, observability tools, and so forth. For observability, this matters because we need our Edge platform to serve the relevant metrics and logs to various teams (with some degree of overlap in certain groups of logs and metrics) to their tool of choice. A logging pipeline tool makes perfect sense for this task.

Services provided by the platform team are in red. Other teams in blue.

Metrics

One goal in our Edge observability journey was to begin to create a mindset shift from “logs first” to “metrics first” for our application teams, a concept we call “first-class metrics”, with “log-derived metrics” as the alternative.

Before going too far, what is a metric and what should it be used for?

In our context, a metric is a data point collected over an interval of time that can provide insight into the behavior, performance, and health of the system. Metric types include:

System Metrics: These are measurements about the system itself such as CPU usage, memory utilization, network bandwidth usage, disk I/O, etc. They provide insights into the overall performance and health of the system.
Application Metrics: These are measurements specific to the application running on the system. For example, the number of active users, transactions per second, error rates, response times, etc. They are often key indicators of the performance and effectiveness of an application.
Business Metrics: These are measurements that tie technical performance to business outcomes. Examples could include revenue generated per user, conversion rates, customer acquisition cost, etc. Sometimes these metrics are the only indicator that a system is truly not operating nominally.

Metrics are useful for establishing a baseline of nominal performance, monitoring the system against that baseline, troubleshooting anomalous behavior, and making decisions (human or automated).

We really like first-class metrics for understanding application behavior and providing good observability context with a smaller volume of data. Creating useful metrics (not too many, not too few) that indicate application performance is tough, and we are still on the journey of moving to full adoption of this paradigm. In our edge compute use cases, the typical HTTP status metrics are less useful since most of our applications do not take incoming HTTP requests, but rather speak MQTT to other applications and devices in our restaurant ecosystem. This requires some more critical thinking to determine what sorts of metrics will be useful for determining health.

Vector

Vector is an open source tool for building observability pipelines, built in Rust. It consists of a few key components:

Sources: These are where data comes from. In other words, sources define the origin of the data Vector will work with. Vector supports system logs, application logs, Kubernetes logs, docker logs, prometheus, data from cloud providers, or any other type of observability data.
Transforms: Vector can also do in-flight transforms. This includes operations like filtering, parsing, sampling, adding fields, type conversions, anonymizing data, and so on.
Sinks: Sinks are target destinations for data post-processing. This could be a storage system like AWS S3, a monitoring service like Datadog metrics, an event bus like Kafka, a database, or some other destination.

Edge Metric Collection

In our Kubernetes clusters, we use Prometheus to collect from the /metrics endpoint exposed in each of our pods. All of the metrics we collect are then exfiltrated to our Cloud Vector instance using the Cloud Vector as an Edge Vector Sink.

In many cases, it has been a difficult transition as we attempt to separate metrics and logs into separate streams. Many tools do not handle this natively. In the event that the two are combined, we leverage a sidecar that scrapes metric-related tags from logs and forwards those metrics to Vector so that we can have a complete metric-driven view of the environment’s operations.

Logs

Even though we prefer to default to first-class metrics, logs are still critical to many teams. We believe they will always be useful in certain contexts when troubleshooting difficult problems that require extra context beyond what metrics or traces provide.

Log Collection

By default, all edge-deployed applications (should) log at a DEBUG level. Vector collects these logs by scraping the Kubernetes logs interface and stores them locally. For each of our application deployments, we have a parameter that allows for configuring an amount of rolling allocated storage for logs.

For other devices (fryers, grills, IOT things) we have also created an HTTP service running locally that allows out-of-cluster devices to post their logs for collection and exfiltration. These are generally vendor provided solutions, and this enables us to serve them logs (or metrics) without them writing directly to their own cloud, which is undesirable as we prefer control over all outbound traffic in our environment.

Log Shipping

What is the overall strategy on when/how logs are shipped to the cloud?

Vector transforms filter the logs and by default forward ERROR logs to our consolidated Cloud Vector. Logs are flowing in near-real-time any time the cluster has an active internet connection.

To prohibit cascading effects from potential network saturation, we have separate VLANs with different bandwidth allocations for our POS and credit systems vs. our Edge network. In an offline scenario, logs will be collected within the rolling storage parameter that was defined and exfiltrated once online again.

But wait! What if there is a situation where I need more information from my application logs and ERROR is not enough?

If a team needs more detailed logs for a temporary time period for production support purposes, they can call a custom, cloud-deployed API endpoint that will change the logging collection filter to DEBUGfor a specific application (pod) at a restaurant or set of restaurants for an arbitrary time period, generally 30 minutes.

Once that time period expires, the logging collection filter returns to its default. Restoring the default is something that could easily be forgotten but that would have potential long-term impacts if not addressed, hence the automated control. Ultimately this solution ensures that production support is possible but has minimal long-term impact on our restaurant fleet. Over and over again, automation and self-service are essential to operating an edge platform successfully at scale.

Fan Out

Logs and metrics are delivered to an AWS cloud environment that is owned by our Enterprise Restaurant Compute team. We have a cloud-deployed Kubernetes cluster (EKS) that runs most of our control plane services. In this environment, we have a centralized “Cloud Vector” instance. It has sinks configured for each “deployment” that forwards logs to the appropriate target for the audience that is interested in them.

This enables teams to use whatever tool makes sense for them and gives them autonomy to build reports and dashboards to support their application(s). Most of our teams send their logs and metrics to Grafana, Datadog, or CloudWatch.

In addition, all alerting events are sent to OpsGenie, which creates alerts for teams based on their on-call rotations.

No no’s

There are a few things we do not allow for edge applications or out-of-cluster devices.

One of those is direct logging and metrics to the cloud. Doing this from a client application or IOT would bypass all of the controls that we have in place to manage bandwidth constraints, particularly in a backup network connection mode where bandwidth is highly constrained. In these cases, we risk network saturation and potential prioritization of less important data egress.

Conclusion

Observability at the Edge is a challenge and requires careful selection of key indicators of both platform and application health. A mindset shift from using logs as a crutch to creating first-class metrics is transformational but challenging to drive since it requires engineers to adopt a new paradigm. Nevertheless, using first-class metrics is a great way to get high quality observational data about a platform or application at the edge without generating too much outbound data. Vector has been a powerful tool to help us transform our observability streams on the fly to reduce outbound operational data to the required minimum so that we can maximize bandwidth for business telemetry instead. We have been really happy with how this architecture has developed to support our current and emerging needs at the Edge.