Data Reliability at Chick-fil-A

Published in

chick-fil-atech

5 min readMar 1, 2024

Chick-fil-A has over 3,000 locations across the USA, Puerto Rico, and Canada, with over 8 million orders per day. The amount of data being tracked and processed, including Restaurant data points, customer orders, and other business operations information creates a data rich landscape, but also a multitude of challenges. Data Reliability Engineering (DRE) helps Chick-fil-A approach these challenges and utilize resources to create a reliable system that supports the business and customers on a daily basis.

Phot credit: https://www.mercurymediatechnology.com/en/blog/5-tips-to-help-you-understand-your-data/

The concept of Data Reliability Engineering stems from an old friend, Site Reliability Engineering (SRE). SRE applies engineering solutions to evaluate the service/application stability offered by a team. For DRE, we are using the same engineering solutions to determine the stability and quality of the data. The end goal is to have a set of tools and processes that teams use to solve data challenges in a scalable way.

These processes and tools are applied to pipelines/data warehouses to monitor incidents, set expectations on the data, and bring visibility into the quality of the data being produced and consumed. When all these are in place, the team in charge can quickly identify the problem in the pipeline. In short, DRE is treating data quality as an engineering problem so that when something breaks in the infrastructure, the team can distinguish between a data problem and software problem.

Photo credit: https://www.montecarlodata.com/blog-what-is-a-data-engineering-workflow-definition-key-considerations-and-common-roadblocks/

DRE Principles

DRE borrows the same principles from SRE, and these principles help clarify what tools and processes to use.

• Embrace risk. Accept that things will break at some point.

• Set standards. Know what the expected data is, otherwise, you don’t know how to measure the data.

• Reduce toil. When something breaks, the team needs to know exactly what to do.

• Monitor everything. You don’t know what you don’t know.

• Use automation. Remove barriers between knowledge and application.

• Control releases. Intentional changes.

• Favor simplicity. Less complex systems are more reliable.

How Chick-fil-A Applies DRE

As we look at our data pipelines and evaluate how to apply the DRE standards above, there are a few things we keep in mind. We focus on standards, frameworks, patterns, and best practices to empower our business partners with confidence in the analytic engineering work they are delivering. Usually when people hear the word framework, the first question that arises is what tool should be used. Our goal is that teams start focusing on standards rather than tools.

Standards

• Quality tool. A tool to evaluate data (Delta Live Tables, DBT, custom tool, etc.).

• Metrics. We provide metrics to help stakeholders understand their data so they can better manage expectations for the data.

• Dashboard. Provide SRE monitoring regarding pipelines’ SLA and bring visibility to all parties dependent on a specific pipeline (Grafana).

• Support. On top of providing engineering support, we all support the platform environment as well (Airflow, Databricks, DBT, etc.).

Custom Use Case

A homegrown tool called Data Reliability Broker (DRB) (formerly known as Data Quality Broker (DQB)), was designed to conduct data quality checks on data. This tool takes advantage of Spark for its parallel processing power and Great Expectations framework for writing data quality checks, which is supported by our engineers 24/7. It also provides metrics on a Grafana dashboard to bring visibility to anyone who is dependent on the data being processed through the pipeline.

There are 2 versions of this tool: DRB and DRB Core.

• For DRB, the user’s data is in S3 (it can be in parquet, csv, delta, or JSON format) and does not own a compute space. They simply need to trigger the first lambda with the payload shown below.

• DRB Core is a python package that conducts data validation using Great Expectations, reports the metrics to the Grafana dashboard, and notifies our engineers in case there’s an issue for engineering teams that already own a compute space, and their data is in memory in a data frame format.

Architecture

This is the architecture used by DRB to run data validation, generate metrics, and support any potential issues with the pipeline.

User

The requirements to trigger DRB are:

• Expectations/rules using Great Expectations in S3. In case this is not present, we profile user data to help the user better understand their data.

• Source files S3 path for files to be ingested and run the expectations/rules against.

• Output files S3 path where the results need to be stored.

DRB AWS services:

• dq-broker-agent. Processes the user request.

• dq-broker-preprocessor. Runs sanity checks and prepares compute space (Glue job).

• dq-broker-process. Keeps track of each step and its status.

• Glue job. Evaluates the input file(s) based on its expectation, writes the result, and publishes metrics and slack notifications.

• Support. CDOps are notified if an error is found in any step of the DRB pipeline. The DRB team is notified if it is a pipeline error, or the point of contact if it is a user configuration error.

• Report. Reports metrics of the run (data quality metrics) and errors to the team slack channel (if any).

Conclusion

DRB allows us to better understand what the data being processed in our pipelines looks like. With this visibility, we can determine the standards for our data. If something does go wrong in our pipelines, we are able to point out if it was caused by the data received or if it was a software problem.

Data Reliability at Chick-fil-A

Written by Djelany Borges