Availability SLAs, the truth behind keeping services always available - Part I

Diogo Guerra
Feedzai Techblog
Published in
7 min readFeb 7, 2019

At Feedzai we provide services to the largest banks, payment processors and online retailers. These customers depend on the availability of Feedzai APIs to check for fraudulent activity and stop it. This is the first of a series of articles that cover the subject of service uptime and how SLAs are often defined without the proper thought process.

In today’s world, every customer expects services without disruption. Especially the ones that are transitioning from operating on-premises data centers to now becoming customers of cloud-based solutions. The cloud brings the promise of lower costs, higher flexibility, and even higher availability. However, that applies only to the infrastructure and not to the applications built on top of that infrastructure.

A service without disruption would mean an uptime of 100%, but customers know that this is not something that any service provider will accept as an SLA, so they expect a contract that has a commitment between 99.9% and 99.99% (if you are lucky, otherwise there might be ones asking for 99.999%).

You might think “yep, I can do it, I’ve designed my application with high availability over multiple cloud providers, there’s no single point of failure.” However then problems start to appear: a developer made a mistake, your provider failed in a way that you were not expecting and even before you realize, the SLA was breached that month.

Before going to the nitty-gritty details of how challenging it is to ensure such SLAs, let’s look at some outages in cloud services during 2018 and how they affect the availability percentage of those services:

  • May - AWS — On May 31st, affected customers were down about 2 hours and 23 minutes due to a power outage affecting physical servers and networking devices. AWS’ core EC2 service, as well as RDS, Workspaces and Redshift were all impacted.
  • June - Google Cloud — Google Compute Engine. VMs started to be allocated with duplicate internal IP addresses leading to a major service disruption. The disruption took 21 hours and 45 minutes.
  • November - Microsoft Azure — Storage disruptions in West US region kept many Microsoft cloud customers separated from their data for more than ten hours.

To analyze the impact of these outages in the availability percentage of these services, let’s see the maximum downtime per year and month that is acceptable to achieve a certain availability.

Downtime per availability percentage

In general all services measure their SLA on a monthly basis, meaning that every month the SLA gets reset (also, we should only look at downtime per month column). One of the main reasons is that there is a high chance that once a problem exists, the SLA will get breached and a significant amount of time can be lost. With a monthly SLA, the credit or refund that the provider needs to return is proportional to the cost of that month instead of the cost of that year.

Let’s evaluate then the uptime of those services in the month of the outage:

  • May - AWS - 2.38 hours of downtime, 99.68% uptime
  • June - Google Cloud - 21.75 hours of downtime, 97% uptime
  • November - Microsoft Azure - 10 hours of downtime, 98.7% uptime

These are services provided by three of the largest cloud companies with the best teams that who have years of operational experience. Note: these examples are the exception. In general, these services in general achieve their SLAs, otherwise they would be out of business, but the listed outages demonstrate the challenge to achieve such stability.

So, even without looking at more details, it should be safe to say that achieving high uptimes might not be as easy as it seems.

In this post, I will cover two generic topics about systems availability and how uptime numbers are underestimated because engineers don’t look at the limitations of the provided service from a theoretical or practical standpoint. Instead, commitments are made by a gut feeling or because competitors also do it.

SLA numbers can be deceiving at first glance

When designing the architecture of your system and choosing cloud-based services it is important to understand if they are working in parallel or in a serial way. A simple way to check this is to understand whether the whole system goes down if one of the services fail. If that’s the case, then the system is set up in a serial way, or in the critical path of the service.

The way you can calculate your maximum theoretical availability of the combined system is multiplying the availability percentages of all the services in your critical path.

Let’s calculate the maximum availability of a system that uses a Load Balancer, a virtual machine for your application and a relational database. For reference, use the expected availability of AWS services:

Total Availability (99.8%) = Availability ELB (99.99%) * Availability EC2 (99.99%) * Availability Application (99.9%) * Availability RDS (99.95%)

Assuming that your application has a theoretical availability of 99.9% and all AWS services offer an availability above that, one might think that the availability of the system would be at least 99.9%. That’s wrong, the theoretical availability of the system is 99.8% and suddenly the downtime per year can go from 8.77 hours to 17.53 hours. This would happen if all the services would fail at the same time, which is not very likely. However, we can’t ignore the fact that all these services run in the same infrastructure and they can fail in parallel (as shown in the AWS incident in May).

Be careful about choosing how many components your critical path will have and how you can degrade your service gracefully without downtime.

The importance of self-healing systems in availability

In the previous section, we covered a theoretical aspect of the availability of a system. Now we will cover the raw nature of operating and supporting services 24/7 and how that can affect uptime.

Imagine that you are designing a service with a target of 99.95% monthly uptime. That means that your system can be down at most 21.92 minutes per month.

It is not rocket science to design an architecture that covers for the most common failures such as hardware failures, system crashes, lack of disk space, etc. In general: align multiple instances in parallel and replicate your components. The critical piece is to make sure your health check is accurate and today’s cloud providers will handle those failures for you.

However, one of the most common situations is that the failure is applicational, and it affects all your instances at once (e.g. a memory leak, a bug in the software). Suddenly your highly available setup is out of service and it’s 4 AM in the morning.

  • 04:00 — The clock is ticking, 22 minutes countdown or else your target will be missed.
  • 04:03 — The alerting system alerts after 3 minutes, but you want to reduce false positives, you send an alert only after 2 consecutive reads of failure.
  • 04:08 — The 24/7 operations team logs in and tries to understand what’s happening. After a few minutes of looking at monitoring dashboards, they don’t think this is an infrastructure issue and decide to escalate to the on-call application team.
  • 04:10 — The on-call engineer from the application team was awake and is logging in to understand what might be happening.
  • 04:20 — The on-call engineer is looking at the logs and thinks that the system hit a limit and they need to reconfigure all the nodes and restart them.
  • 04:23 — All the nodes reconfigured and a restart is triggered. The system needs 5 minutes to boot.
  • 04:28 — The system is back up.

Meanwhile, in this hypothetical scenario, the system failed the target uptime by 6 minutes, with just one unexpected incident and assuming a very aggressive timeline of people understanding issues and escalating them.

These type of unexpected failures happen more frequently than one might suspect when you operate a fleet of almost 1000 servers. The need to have smarter mechanisms that constantly evaluate the health of a system and have the ability to proactively fix those issues is key to keeping uptimes high.

At Feedzai, our core platform is a complex stateful application server that has considerable warm-up times. We introduced the concept of watchdog that is programed to systematically check for patterns of degraded health. It can also act on each one of those patterns with a proactive fix, auto-healing the system. This component has saved us many times from breaching SLAs: when the on-call team gets the alert of an error, the system is already handling it automatically.

The focus on recovering the system faster is as important as not allowing the system to fail because for the customer, as long as the service is not responding, the reason is not relevant.

Conclusion

Building systems to target monthly uptimes above 99% is not an easy task. It’s beyond developing the right software or integrating a few cloud services together and hoping for the best.

Even the largest cloud companies in the world have outages every year that affect their uptime commitments and costs them millions of dollars.

Analyzing your system and calculating the theoretical availability will help you understand which commitments you can make in your business. However, that availability can be highly affected by your ability to react to failures and recover the system.

Stay tuned for Part II of this series where we will discuss how Feedzai plans and tests the service in order to achieve industry-leading uptimes.

Thanks to all the engineering team at Feedzai. This post contains learnings that were collectively gained over the years by many people at Feedzai.

Update: Part II is now available.

--

--

Diogo Guerra
Feedzai Techblog

VP of Engineering @ Feedzai — Passionate about distributed systems and high performance