Availability SLAs, the truth behind keeping services always available — Part II

Published in

Feedzai Techblog

6 min readFeb 20, 2019

In Part I, we’ve discussed how challenging it can be to maintain high uptimes, from understanding theoretical availability calculations to the impact of response times when incidents occur. Even for large corporations with large service footprints, cloud-based incidents happen and those uptimes go down.

In this post, we’ll cover the planning and architecture of such systems and the important questions that often are not asked that can seriously affect the uptime of a system.

The one question we see most often not being asked is “what failures do you expect to tolerate?” It seems obvious, but, after being surprised by that question, engineers tend to rush to answer “all” — especially when the system is deployed with redundancy, which is obviously not feasible.

In order to look at failures, let’s break them in two large categories: the ones caused by the software we build and everything else. There is a reason for this breakdown: you can only change the software you build and make it more resilient. You may or may not be able to deal with everything else in a graceful manner.

At Feedzai, the way we plan and deal with the failures of our systems is by defining a failure model for the systems. What does that mean?

First, you analyze your system for the functions that are critical for users. This allows us to:

Be clear to the whole team about the critical functions.
Analyze the dependencies of those functions and have clear critical paths for the system.
Focus the tests of the failures by asserting the critical functions defined in the first step.

Second, you start defining failures that you expect the system to be able to handle. This piece is also very important. There’s an unlimited number of failures that can happen, from the most obvious ones such as complete hardware or network failures to the more tricky ones like network partitions, time-consuming stop-the-world garbage collection events or incorrect human behavior.

With 6 years of production systems, Feedzai has a pretty good list of failures in Java distributed systems that are latency-sensitive and we continue maintaining that list as we see new failures happening.

Third and last, is to define the possible states that a feature can be in after a failure. We defined 3 main states:

No Impact — The feature is not affected and continues working as expected.
Degraded — The feature operation is not compromised, but the system is working in a sub-optimal manner. For example, a node is down and the feature is working but there is no redundancy that can handle more failures.
Failure — The feature is partially or fully unavailable.

For both Degraded and Failure, we also define sub-states when the system can automatically re-heal or if there is a need for manual intervention to restore the system’s health.

After defining features and failures, we define a matrix that is public to our customers and is especially important for operations teams. An example of a failure model matrix can be found below.

The easy part is done (note that a failure model matrix can be something with 10–20 cells or hundreds of cells). The time-consuming part is to test, automate and document it.

For testing, we have developed a simple abstraction over Docker and JUnit that allows engineers to model features, failures, and then assert the state of the feature after a failure. The developer just defines the expected states for the combinations and the framework is responsible to setup the infrastructure, inject failures and assert for the resulting state.

After doing this process for the first time in each system, we consistently found problems that we were able to fix and make the system more resilient. This framework is also able to validate the system state with failures in which we expect auto-healing to solve the problem.

Having a detailed failure model is essential in dealing with failures because not only are they systematically tested, but also operational teams have troubleshooting guides that are written to cover all the situations where the state is not “No Impact”.

A failure model is only as good as its correctness. Automation of failure model tests is critical to make the system more predictable. We have gone beyond and made this part of the definition of done of any feature development as well as reviewing it as part of post mortem procedures.

The approach of defining a failure model in this format is very helpful for the components whose development you control, but also useful for other components because it prepares teams for a more informed and reliable operation.

Understand the systems that you depend on, but don’t control

At the beginning of the article, we categorized the failures in two areas: the ones caused by the software we build and everything else. This second part focuses on a subset of “all the others”, namely when using cloud-based systems.

In Part 1 of this series, we did a simple exercise of calculating the theoretical availability of a system composed of multiple subsystems where most of them are completely out of the control of our operations team.

These systems have an expected availability, however since they are part of your critical path, you should prepare for failures as well. More than prepare, you want to think about the potential failures and how they can affect your system.

One of the biggest challenges with planning and testing cloud-based services is that we have limited knowledge of both the physical and logical architecture of these systems.

For example, when using AWS, a very common question is whether the failure model should be designed to tolerate the failure of one Availability Zone or a full Region (which is composed by a group of Availability Zones). When looking at incidents history you will realize that is a tradeoff of cost/benefit. This is a topic of recurring discussion with our customers since more redundancy means more cost and more operational complexity.

My recommendation is for you to remain focused on facts and information available both from the web and from your cloud provider technical support team. For example, it is more common to have a service being disrupted across Availability Zones that is to have all services of an availability zone failing. So, should you design for handling an AZ failure or for a Region failure? There’s no right answer. You will have to weigh the tradeoffs, the cost, the overhead, and the time it will take to handle each one of them.

Another challenge with planning failures is that it is hard to simulate them and know exactly how the system will behave so that you can plan accordingly. E.g.: how to test the failure of an RDS instance? Will all tables of DynamoDB fail in the same way or at once?

There’s no easy answer for that. The approach we take at Feedzai is always to ask for support of the AWS technical teams and come up with alternative solutions. For example, in order to test a complete AZ failure, we implement ACLs that completely block the traffic between AZs and simulate a network partition. We made this procedure part of our Disaster Recovery tests.

These are the questions that are very hard to empirically test, but if they are not asked, they will create failures, teams won’t be ready, and customers will be impacted. If you are designing a system for high availability, do keep in mind the non-obvious failures and be obsessed with testing them.

Conclusion

In a world where distributed and high-available systems have become very common and easy to build, is important to return to the basics and ask the right questions in order to cover the fundamentals.

To think, design, test, and document a failure model is key for understanding what to expect from the system and to prepare teams to deal with failures. Leveraging cloud-based platforms and services helps the availability of your system, but brings extra challenges when testing failure scenarios.

In the third part of this series, I will cover some of the operational challenges from deployments to monitoring and their impact on uptime.

Availability SLAs, the truth behind keeping services always available — Part II

Understand the systems that you depend on, but don’t control

Conclusion

Written by Diogo Guerra