Health and availability in computer systems

Are you alive? Sure you are but, are you operative?

Operating a system is a complex discipline. Claiming that our system is doing well means nothing if users can perceive an outage. Thus establishing that a system is working properly requires the conciliation of two points of view: what you can say about your system vs what your users can say about your systems.

If you visit a doctor for a medical check up, the doctor will run a set of tests: electrocardiogram, blood pressure check, cholesterol level check, etc. After those tests, doctor will determine whether you are healthy or not based on the results.

Health in computer system is not so different. A probe is sent to determine whether a server is able to process an input or not (e.g to process a HTTP request or consume a message from a queue). Such a probe considers two dimensions that describe the basic functioning of the service and its current ability to accept work: liveness and readiness.

Liveness: Indicates whether the service is running. If the liveness probe fails there is a big chance the service died or needs to restart. Liveness checks are similar to vital signs as they indicate the status of the body’s vital (life-sustaining) functions. Questions like is there a heartbeat? or is the patient breathing? can be answered.

Readiness: Indicates whether the service is ready to receive input. If the readiness probe fails, then whatever controls the traffic to that service (e.g. a load balancer) should stop sending traffic to it to avoid failures. Readiness checks are like reflexes — they verify the integrity of the nervous systems. Questions like are the pupils reactive? or is the plantar reflex present? are supposed to be answered in a readiness test.

The upstream-to-downstream health checking can happen from point to point (let’s say by enabling a circuit breaker) or via some control system (if a load balancer needs to determine whether to forward request to an instance or not). A service may choose to reply “unhealthy” because it is not ready to take requests, a third party dependency is down, it reached the maximum number of database connections, it is shutting down or for some other reason. The client can act accordingly if the response is not received within some time window or the response indicates an unhealthy service.

As you can tell, this is a very nagios-ish monitoring style: a service is either up or down and although we have two dimensions, from client standpoint it is a binary answer: service works/does not work.

Missing piece in the puzzle

Software engineers are pretty good at borrowing concepts from other disciplines. If we look at the WHO’s definition of “health”, we will find some similarities to the one we have in software:

Health is a state of complete physical, mental and social well-being and not merely the absence of disease or infirmity. […]

However our definition of health lacks one important piece: “social well-being”. Indeed, we are not considering how our services interact with clients or upstream as part of our definition of health, we only look for the “absence of disease or infirmity” from our point of view. In that line we need to introduce another idea to our health definition that represents our system as a social being: availability.

Availability is the state of the system described from the user perspective, coupled to specific use cases: error responses in an endpoint, timeouts or latencies in a workflow, etc.

The Google’s Site Reliability Engineering book introduced crucial concepts of service level and ways to quantify the availability of a service: service level indicators (SLIs), objectives (SLOs), and agreements (SLAs):

[…] These measurements describe basic properties of metrics that matter, what values we want those metrics to have, and how we’ll react if we can’t provide the expected service […]

On top of that, JBD describes availability as:

Availability level is described by a set of SLI values and it comes into play when SLIs go beyond SLOs (e.g. when an acceptable error rate, the SLO, is 7%, but the actual error rate, the SLI, is 10%). Most services consider request latency, error rate or system throughput as key SLIs.

Continuing the analogy, for availability we can set SLIs as:

  • Can you stand on one leg?
  • How many steps can you take before you go breathless?
  • How many words can you remember and repeat afterwards?

If you can not make two steps due to breathlessness then there is clearly something wrong, and the reason may or maynot be correlated with your vital signs or reflexes, but clearly from the patient’s perspective there is something up. Then you need to measure your availability levels to determine whether it is in the range of acceptance or not.

It is important to choose appropriate metrics to drive the right action if something goes wrong. SLIs should arise from meaningful user experience, and detrimental changes in user experience should directly influence one or more SLIs in a similar peace.

Conclusion

In summary, availability is as important as health as it conciliate engineering’s quantifiable model of system health with the user’s perception of system health and, although they focus on different aspects of the system, they are both meaningless without each other.

Health helps you focus on static checks of the system itself: Is my service alive? Can it accept requests?, etc. This sort of reasoning can be automated with health endpoints.

Availability is use-case centric:

  • What is the error rate for requests in this endpoint?
  • What is the average processing time for this message type, for this consumer?

It helps to understand system trends: an increase of latency in certain endpoints, or an increase of error rates at an operational level.