Site Reliability Engineering at Chick-fil-A
Reducing “cowtages” and improving “refryability.”
Written on October 5, 2018
by Caleb Hurd and Laura Jauch
Technology has become [nearly] as critical to Chick-fil-A’s growth and success as the delicious sandwiches we deliver and the “my pleasure” you receive in our restaurants. Technology has helped enable us to produce truly staggering sales per restaurant despite only being open six days per week (2016 numbers).
This has necessarily increased our focus on reliability as it relates to our technology stack, which brings us to the Site Reliability Engineering (SRE) practice at Chick-fil-A. In this blog post we are going to cover WHY SRE exists at Chick-fil-A, WHAT we do, and HOW we go about implementing it.
So, WHY is there SRE at Chick-fil-A?
Let’s be honest … because it’s cool. We want to have our avocado ~tech~ toast and eat it to!
Kidding aside, most of the readers probably already know that SRE started at Google when developers applied their dev skills to solving operational problems. The benefits of SRE include increased reliability for a software service while preventing over-engineering of the solution. The goal: let developers focus more of their time on new features, because let’s be real, new features are wayyyy more fun than ops work.
We have taken a tremendous amount of inspiration from both books by Google on the topic of SRE (their recently released SRE Workbook is a must-have read for SREs).
We have tailored Google’s SRE practices to our own environment because at the end of the day our focus is on selling chicken, not ruling the world’s data (excuse us, “organizing the world’s information and make it universally accessible and useful” we can’t help but poke fun at our heroes!). However, we can confirm that we don’t have any active or future secret Pentagon poultry projects.
SRE at Chick-fil-A
First, here are a few things to keep in mind as we define SRE at CFA:
· Our SRE practice is new and does not yet span all our teams.
· We are a small, crack team of SRE’s, so we have to be very strategic about our areas of focus.
· Not everything in this article is fully implemented yet, but we are very excited with the direction in which we are moving!
Enough with the caveats! What is the purpose of SRE at CFA?
Reliability
Mostly, we exist to reliably make reliability more … reliable.
Seriously though, the answer is:
1. To ensure that software is reliable enough for customers.
2. To enable developers by lowering delivery cycle time and reducing toil.
This is a fancy way of saying that we are a voice for our customers (internal and external) from a reliability perspective, and a voice for the developer from a tooling, time management and toil perspective.
On the surface this sounds great! Customers and developers both love us, right?
Unfortunately, when it comes to getting some customer and developer love, the odds are not in our favor. Customers expect things to be reliable. Developers want to focus on feature development and want to move fast. SRE stands “smack in the middle” of those competing needs. However, despite this considerable challenge, in the words of the late Han Solo …
“Never tell me the odds!”
So, with the odds against us, how do we navigate this tension between customer’s reliability expectations and developer’s “need for speed?” In part, by considering how we can take the emotion out of what makes a product “feel” stable to a customer. How can use data to help us? How can we provide clarity for developers on an objective measure of what “reliable” means, so they don’t spend too much or too little time on reliability? By setting and reviewing Service Level Indicators (SLI’s), Service Level Objectives (SLO’s)! Again, Google comes through with a great introduction on the topic.
In addition to SLI/SLO, there are a few more central factors that directly relate to a developer’s ability to build reliable systems:
· Easy, frequent and safe deployments with quick feedback loops that enable us to “fail forward” and avoid red tape and process overhead.
· Tools that offer visibility into application and infrastructure performance.
· Monitoring so that the on-call team knows when things are down and can bring them back up quickly.
· Auto-remediation of problems so that human error is largely removed from the system.
· Automated coffee delivery to reduce toil and improve caffeinability (because having to get your own coffee is too dang hard first thing in the morning).
These are all activities, tools and processes that contribute to the overall stability of the product.
Our pods (ie; teams) at CFA are small, cross-functional and autonomous and (mostly) organized around products. They can choose to implement or reject SRE practices (within reason). This means we as SREs have to use influence, not authority in order to improve a product or team (pod). The wonderful thing about not having authority and working strictly through influence is that it drives you way past “good enough” solutions and forces you to make “undeniably better” solutions. Otherwise, the fear of changing to something new will be greater than the perceived value of what is being offered the team.
All of this adds up to a delicate balance of ops knowledge, dev chops, customer awareness and conflict resolution skills. But probably the most important attribute needed to be a successful SRE at CFA is compassion. If we aren’t empathetic towards the customers we are trying to server, and the developers we are trying to help, they can and will reject our products, tools and philosophies. Also, blameless post- mortems are a cornerstone of an engineering culture that wants to lead the way; moving fast and effectively involves risks, which will produce mistakes. These should be treated as a system issue that needs adjusting rather than simply blaming the human as if they were the originating cause.
SRE at CFA is as much a human-centric activity as it is a technical one.
As a small team, we must carefully choose where to invest our time so as to prevent thrashing members of our own group. That being said, we believe that SRE is a mindset and not a title. One of our approaches is to deputize engineers (ala Barney Fife) within the pods so that reliability can be an early consideration within their development and launch cycle. Shared ownership is a team win!
What We Do
That brings us to the second question: What does SRE do at Chick-fil-A?
We make deployments easier
· Pipelining
We improve uptime and reliability
· On-Call Process
· Load and scale architecture
· Monitoring
· Consolidated and searchable logging
We establish visibility into app performance
· APM + Monitoring
We help make things reliable enough… but not too reliable!
· help set SLI’s / SLO’s / Error budgets
We help the team move faster
· Error budgets enable product owners to take on more risk
· Infrastructure as code (reduce toil)
· Reduce the cost of failure (canary deployments/automatic rollbacks)
We believe that these contributions to our development teams and the larger organization significantly improve overall customer experience and our developer’s ability to run large scale production workloads.
Thusly, not all products have the same maturity level. In the SRE Workbook, there are multiple levels of maturity suggested from the inception of a product to its sunsetting. In our case we have identified three stages we are most interested in and have mapped our Goals (tactics) to Tools (Strategies) based on the maturity level of these three phases. Some pods may require certain things out of order, and we respect that.
This is our maturity model:
How
We come, at last, to the third and final question: How does SRE go about accomplishing its WHAT so we can fulfill our WHY?
Here are some of the tools we currently use or are evaluating to fulfill our objectives. Most of our teams deploy to AWS, so you will see many AWS-centric tools):
Pipelining: Jenkins
Monitoring: Prometheus / Cloudwatch / DataDog / Logic Monitor — whatever is appropriate for the pod
On-call / blameless-postmortems: OpsGenie for the On-Call software and (hopefully in the future) statuspage.io to handle status pages and subscriptions/alerting to stakeholders
SLI/SLO — setting them and reviewing them regularly; Cloudwatch/Prometheus to track them and regularly review them
Consolidated and searchable logs: ElasticSearch and Splunk. One of the things we love most about ElasticSearch is that in addition to easily searching for and graphing logs, their paid for product offers the ability to easily alert off of predetermined metrics right in to our on-call system (OpsGenie)
Infrastructure as code: Terraform, Ansible, Cloudformation
For the purposes of this blog post, we’ve selected a single area to dive deeper on our implementation strategy: our on-call process. Since our pods are autonomous, they implement (or will likely implement) this process differently and as they deem most appropriate for their product and team.
Why should you be running an on-call process instead of letting your developers organically support their product?
1. You can start eliminating the “hero culture” where a handful of devs that have broad knowledge are the only ones that can fix issues. They will get burnt out, they and their tribal knowledge will leave and you will be left standing alone as they ride off into the sunset/galaxy (pick your favorite epic movie abandonment scenario).
2. The single on-call will get thrashed for a limited period as opposed to the entire team having to focus on an outage.
3. There is a clear owner/leader who is the point-person for resolving the outage.
4. The weekly rhythm of reviewing outages pushes the team (and product leadership) to root cause and address recurring issues as opposed to just “rebooting the server” when problems re-occur (but seriously, have you tried turning it off and on again?).
Below is an illustration of how a simple website that begins to return 500’s is identified and flows through our on-call process:
Some considerations to think through as you build your on-call process:
1. The On-Call should only get woken up in the event that the issue needs creative human intervention. If it does not require creativity, it should be root caused and the solution scripted/automated.
2. The On-Call should be empowered to make decisions to resolve the outage. No chains of command should get in the way, as it will simply elongate the outage. In the words of Yoda; “Do. Or do not. There is no try.”
3. If the On-Call is empowered to make critical decisions, the culture must be blameless in nature, otherwise fear of mistakes and their repercussions will cause On-Calls to freeze in the face of an outage.
4. Once the On-Call is awake, they need total visibility into the application, APM, time-series monitoring, searchable and consolidated logs (like elasticsearch) and any knowledge-base or documentation that is useful. Also having access to a directory with the cell numbers of the engineers on the team (and preferably the company) comes in handy as others need to become engaged, you don’t want to add downtime because a phone number wasn’t handy!
5. Rotations should not put an on-call person on rotation more than three weeks apart, or off rotation more than six weeks. Too often results in burnout, and too much time off results in a lack of comfort as to how all the systems fit together. Muscle memory, people!
6. If the on-call is new to the overall app and needs to engage another engineer, the newer on-call should do the write up and documentation for educational purposes.
7. Every alert that occurs twice should be root caused and fixed. Some suggest once, but we don’t mind letting the blips go for the sake of our own sanity.
8. Each week, an on-call hand-off meeting occurs. A Product Manager MUST be present so that outages for the week can be prioritized into normal work queues for appropriate follow up. SLI’s and SLO’s are also reviewed during this handoff to determine if the product is stable enough to continue with normal feature development or if it efforts should be shifted to improve reliability.
A great read around alerting and how to cut down on noise and not burn out your on-call can be found in the SRE workbook, or in this epic write up on the topic.
Stay tuned as we’ll be diving more deeply into other SRE areas in future blog posts, including how we take “chickens not pets” to a whole new level in our infrastructure.