Design and Test for the 7 Architecture Pillars

EA Principles Series Part 7

Brian Chambers
chick-fil-atech
7 min readMar 7, 2023

--

This is part 7 of a seven-part series where we unpack each of our Enterprise Architecture principles. As a reminder, we use these principles to build a shared mindset across all of the people implementing systems within Chick-fil-A’s ecosystem. This principle covers a lot of the key items we want teams to think about when they build, buy, and implement software.

Photo Credit: http://www.hip-health.com/seven-pillars-of-health/

Design and Test Systems for the 7 Architectural Design Pillars

Here is our principle verbatim. Since this one is long and has a lot of nuance, you’ll find our comments scattered throughout.

All systems that are built or bought should consider the following areas in their design and make decisions about what to do based on business requirements (SLA, etc). The Enterprise Architecture team is happy to provide a review or consultation on any team’s architecture prior to implementation to help weigh risk vs. reward for these investments.

This is the preamble. Let’s continue…

Usability

Usability — think about the user experience of your system, but also consider the role of the people that use the system and how it fits the larger flow of their work responsibilities on a daily basis. Strive to keep interfaces simple and intuitive. Strive for cohesion (especially across systems).

User experience is very important to us, especially as it relates to the experience of the Owner/Operators and their team members that work in our restaurants. Naturally this means trying to create intuitive user interfaces, but it also means that we need to be cognizant of the user’s experience in their role. Are they constantly changing between systems with different interfaces? Do we present things in a way that is intuitive and reduces cognitive load? Or do they have to context switch a lot which is taxing in a very busy restaurant environment? Cohesion is king here.

Maintainability

Maintainability — ensure that code is clean, adheres to language-specific best practices, is appropriately commented, is easy to understand by everyone on the team (appropriately simple), and is testable. Maintainability is critical to enabling rapid iteration and change in a code base.

Maintainability is important to being able to make changes in the future. Software that is unnecessarily complex is not maintainable.

I recall a time working on a project where the code was completely functional and very technically “well-written.” It used all the latest and greatest features of the language it was written in… but it was so abstract and complex it was nearly impossible for a new person to reason about it and therefore make changes to it. That project was ultimately completely rewritten (in a simpler way, and in Golang — bonus!). The ability to understand and reason about code is critical to future changes happening quickly and successfully. Easy to understand is more important than as concise as possible.

It is also impossible to make changes rapidly and with confidence without a solid foundation of automated testing.

Scalability

Scalability — ensure that the system is designed to scale under load. In most cases, this will mean it scales horizontally (via containers or virtual machines). Systems should also be tested for scale regularly to ensure scale triggers, monitoring and configuration are accurate.

Scaling is table stakes for systems. If they can’t or don’t scale effectively, they provide a bad user experience and/or break important business functions. Most of our scaling these days is via pods in Kubernetes and via EC2 instances scaling behind our clusters. In this principle, we want to ensure teams think about their scale triggers, configuration, and monitoring capabilities. Are we triggering off the right metrics to trigger scale events at the right times? Are we able to observe what is happening to our infrastructure / platform to capture those critical metrics (whether it’s HTTP requests, instance utilization, or whatever makes sense). Have we tested our configuration with load to ensure it behaves as we anticipate it will?

Availability

Availability — each system should have Service Level Objectives (SLOs) that are agreed to with business partners. The product team should design the system to meet that service level from an availability perspective, taking into account factors such as the infrastructure footprint, least common denominators in availability (you are only as “up” as your weakest component), graceful degradation behaviors, and mean-time-to-recovery (MTTR) from failure. These factors may lead some teams to complex (but appropriate) multi-region architectures, while other teams maintain simple infrastructure footprints with rapid recovery strategies.

There are a lot of trade-offs that can be made here. What is better? A simple solution that can be easily developed and managed but that risks occasional breakages with quick recovery times? Or is it better to build a highly available but much more complex architecture that is more difficult to reason about?

We cannot answer these questions without some context about the problem we are solving. It depends on your objectives. Design for objectives. Don’t design for what is technically possible.

At Chick-fil-A we made tradeoffs here all the time. With our Edge Compute stack, we elected to design a solution that was highly recoverable vs highly available given the objectives and constraints we faced.

With our Chick-fil-A One App (customer-facing, high scale, revenue-generating) back-end APIs, we have a much more highly available footprint that leverages additional regions and such.

In our environment, a cloud solution that is at least multi-AZ is the default. Whether a team elects to run active-active or active-passive multi-region or simply stick to a single region is up to them and their ability to execute on such a design while meeting their business goals. There is a cost to complexity (both in real dollars, opportunity costs, and cognitive load). All of these should be considered in a good design.

Security

Security — security is a shared responsibility, and the designers and builders of systems must ensure they have a strong security posture throughout design, implementation, and maintenance. Chick-fil-A’s security team provides a wealth of documentation and tooling to assist teams in this area.

The key here is that we want everyone to remember that security is a shared responsibility.

Portability

Portability — portability is generally about designing for business continuity. At Chick-fil-A, we recommend using Cloud Platform services because of their high value add and lower maintenance requirements, but recommend being intentional about considering portability when designing a system. Ultimately this is a risk vs. reward scenario that each team must wrestle with.

Our principle suggests considering portability but not designing explicitly for it.

Let’s be honest — building a portable solution means finding common denominators and then building everything needed on top of them. It is a hard thing to do well. We trade portability for value from cloud/SaaS services on a regular basis, and are okay with it.

One thing we can do is use services that adhere to more standard (and therefore) portable interfaces — for example PostgreSQL compliant databases — instead of more proprietary ones. We believe there is a lot of value in proprietary services like AWS DynamoDB though, and we recognize that we are making a value exchange of great service for vendor/solution lock-in. We have also standardized a large portion of our application workloads to run in containers which brings a degree of potential portability across platforms.

Our principle could be restated more simply as “when you are designing something new, consider how you might make your application portable if it was needed in the future.”

Recoverability

Recoverability — recoverability is about responding to unexpected accidents or disasters while minimizing the impact to a runtime system. Recoverability includes having a strong backup strategy (and regular restoration tests) and a design that factors in potential data loss or service disruption. While higher availability may be appropriate for some teams, other teams may benefit more from a model of fast failure and fast recovery.

This principle applies to all of our systems, whether they are cloud, legacy managed datacenter, or software-as-a-service (SaaS). If there are failures, can we recover?

There are numerous models that can be used, generally depending on data loss tolerance, downtime cost, and transaction volumes. Our product teams should pick what makes sense for their use case, and then make sure to test their approach regularly to ensure it 1) still works and 2) that they maintain the ability to flex that recovery muscle if ever needed.

For a SaaS solution, we usually cover backup and recovery scenarios via our Enterprise Architecture Questionnaire that is part of the vendor selection process. We also have a solution to help mitigate a ransomware event by ensuring critical backups are safe and unreachable by any potential attackers.

These are considerations that should be made at design time, not production operations time.

Conclusion

There is a lot to consider when designing any system. Many factors — business criticality, requirements, traffic volume, team capability, and more — will impact how each of these design principles are considered. The most important thing is that they are part of the up-front design thinking and not afterthoughts. That’s why we have an architecture principle. We may be providing higher-level, somewhat non-specific guidance here, but there is no one-size-fits-all solution across our systems portfolio.

Our principles are all about creating a common, shared mindset. That is what we hope to have accomplished around these critical design pillars.

This concludes our EA Principles Series. We have covered Maximize Cloud First, Steward Our Technology Portfolio and Minimize Long-term Technical Debt, Treat Data as an Enterprise Asset, Design for Composability, Implement Loosely Coupled Systems and Services, Build vs Buy, and finally our 7 Architecture Pillars.

We learned a lot from the community from feedback and are going to make a few principle enhancements internally as a result, so look for more on those learnings in a summary post in the future.

--

--