When designing a new service for production, the architecture can get complicated pretty quickly especially when striving to build highly available services. Balancing availability and reliability of a service is a challenge. Extra reliability consumes larger amounts of engineering resource and cloud resource in order to reach the mythical “100% uptime”.
100% uptime is a near impossible target. Even the internet does not have 100% uptime. There are many different obstacles between the client and the server that are completely beyond the control of the engineering team. Such as client ISPs, core network components and cloud providers. While we can mitigate some of these issues, it comes at greater cost. For example to mitigate a single cloud provider you will need to spin up your application in multiple cloud providers with a method of seamless failover between the two and test the failover method periodically as part of your disaster recover testing.
In this post we are going to talk about how Kudos manages service reliability without having to compromise agility up operational overhead.
Service Level Objectives
In Google’s Site Reliability Engineering book they describe reliability targets as Service Level Objectives (SLO) which are measured by one or more Service Level Indicators (SLI). At Kudos, we use the same strategies for our reliability.
Google defines Service Level Objectives as:
An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound.
Please note, these are different from Service Level Agreements (SLA) which some people will be familiar with.
SLO are typical internally objective to help guide design, operation and management a service. An SLA is a legal agreement between a provider and customer and outlines the repercussions of the provider failing to meet those objectives. This is typically in the form of financial compensation or service credits. However both SLO and SLAs tend to measure the same reliability characteristics of the service.
So if we use our fictitious takeaway shop as our example, we can discuss which of these reliability characteristics we care about for the ordering service.
To keep it simple we will use Availability AKA uptime as our example. We want our customers to be able to order takeaway with as little hindrance as possible, so we want the service to be highly available.
We typically measure the availability of a service as a percentage of time the services is available for the customer over a period of time. This means we can define our Availability SLO for the service as 99.9% of requests are successful over 1 month.
99.9% availability is about 43 minutes of downtime in a 1 month period. This is referred to as three nines of reliability. There is a nice table with the breakdown of each of the nines on Wikipedia’s High Availability article or available as online calculators such as uptime.is.
When defining an SLO it is good to keep in mind the Service Level Agreement (SLA) of dependancies such as the cloud providers you use. For example, here are the SLAs of AWS and Google Cloud are public.
Service Level Indicators
Once we know what our Availability SLO is, we need to define some Service Level Indicators(SLI) in order to measure that availability.
The Google book describes SLIs as:
An SLI is a service level indicator — a carefully defined quantitative measure of some aspect of the level of service that is provided.
So an SLI is just a measurement we can use to validate that we are meeting our SLO.
Choosing an SLI normally involves a bit of a discussion around where you should measure your objective. There are many places we could monitoring the objective from:
For example, we could choose to measure our SLI from the web server logs. However if do that we will be missing requests that do not get to the application, like failures on the load balancer or uncaught exceptions in the application. Still it would be very fast to get up and running with the server-side logs as you already have them available to you.
On the other hand, if we could choose to measure the SLI using client side instrumentation such as analytics tools or frontend libraries, then we would also be measuring the customer’s internet connection and their phone/laptop/tablet reliability. It also means writing and maintaining an additional library within your codebase.
There are pros and cons to measuring each of these collection points and would need to discuss these with the business while we are designing and running the application in production on a regular basis.
For our examples, we shall use synthetic client probes to check our ordering system is responding with a HTTP 200 response every 1 minute.
We can now write our SLO as 99.9% of remote probes sent every 1 minute return a good response over a 1 month period and setup our remote probe to poll our services every minute and alert is if it failing.
Once we have the monitoring in place, we should write up the SLO and SLI into a document and have that document reviewed and agreed with the rest of the business that the SLOs and SLI are acceptable for the reliability objective for that service.
Now we have established the Availability SLO for our takeaway ordering service, we can start looking at Error Budgets.
Another way of looking at our SLO is, as the acceptable amount of availability of our ordering service, so the Error Budget will be the remainder of the 100%.
In the example of our Availability SLO for the ordering service, we would have an Error Budget of 0.1% that is the 43 minutes of downtime that was referred to when we were choosing our SLO.
Now we have the Error Budget for the ordering service we can decide what we want to spend that budget on. We can use that budget to provide benefit to our customers by adding new features quickly, A/B testing those features, upgrading infrastructure or fixing the service when the probes start to fail.
The Error Budget is an excellent indicator for the reliability of the service and can be used as a trigger to help reduce the operational overhead of the service.
If we start to consume too much of our budget, we can freeze feature development of that service to focus on reliability work like add auto scalers, optimising database queries or adding resilience features into the product like rate limiting and circuit breaking. This helps keeps the service as reliable as we have all agreed while still pushing forward with new features.
So if an incident consumed 80% of our error budget in the one month time period we have set for our SLO, we can focus heavily on the preventative actions that come out of our blameless postmortem.
Our Error Budget should also be recorded and monitored to allow the engineering teams to keep track of how much of the error budget is being consumed.
Kudos’s SLO journey
After starting at Kudos as a Site Reliability Engineer(SRE), I was keen to establish SLOs and Error Budgets for all of the micro-services that comprise the Kudos platform. As an SRE I find that SLOs are a fundamental tool for managing services and an excellent implementation of the DevOps philosophy, they were my first port of call and something I wanted to make creation of these as automated as I can to help adoption of the concept and tooling.
We started by auditing all the services that were in production, what they are being used for and most importantly how resilient and reliable they need to be. We also created a pre-launch checklist to help consider SLOs, alerting and disaster recover at inception of a new service. This was then added as a prerequisite before setting any services live.
We decided to create all of our SLOs against our two week sprint cycle which gave us a fixed time-box for the Error Budget.
We started with request driven service such as HTTP servers and API endpoints. We decided that our SLI should be remote probes sent every minute. This measures both availability and latency of our services.
We chose Pingdom for monitoring our externally accessible services due to the geographic checking, mobile app and their API. We wrote a little service that will periodically call the Pingdom API and push all of the probe results into our Elasticsearch cluster for reporting and alerting.
We then automated the creation of the Pingdom check, Kibana visualisation, Kibana dashboard and the Elasticsearch Watch for alerts.
We store the SLO in Google Datastore and have a Kubernetes CronJob that will run once a day. This will fetch the sprint start and end dates from Pivotal Tracker and create/update any dashboards and alerts in Kibana. This is useful as it keeps the time ranges and alerts in Elasticsearch and Kibana up to date at the beginning of each spring without the need for a person to go through and change the dashboards.
We then use Elasticsearch Heartbeat to monitor our internal only APIs and services. The SLOs for the internal services are also stored in Google Datastore and updated by the same Kubernetes cronjob to keep consistent dashboards.
For each of the services SLOs, we evaluated the performance based on a few different sources of information, previous time series data, talked to the product team about what the customer expect and looking at dependencies of the service to decide our SLOs.
We now have SLOs and Error Budget all of our request based service with dashboards and alerting in place to inform us if we consume too much of our Error Budget. Our next step is to look at asynchronous processes such as batch jobs or processing pipelines and develop SLOs for those services as well.
Service Level Objectives and Error Budgets are extremely important to us at Kudos and we will continue to review them periodically to ensure that we are meeting expectations and reliability requirements.
UPDATE: I recently gave a talk on this topic at a DevOps meet up. https://www.youtube.com/watch?v=KmVDkBmnb4U
If all this sounds interesting to you, why not consider joining Kudos? We also have a primer on what you can expect your first day to be like.