System health checks: A lesson from our cars

Published in

Salesforce Engineering

6 min readOct 31, 2016

Have you ever got into your car, turned the key and noticed a strange light turn red on the dashboard? These little lights are something most drivers take for granted. Its easy to forget how they provide us with reassurance that the vehicle is safe to drive when the warning lights are off. And they act as valuable indicators to help us diagnose issues when they light up red. If properly observed these fore-warnings of potential failures could save us money on repairs before damaging our cars — they may even be lifesaving.

There are some lessons we can learn from our car dashboards that should be applied to the systems we own.

In my previous post, I talked about the start of our teams transition from a monolith to a microservice architecture and highlighted the need for supporting infrastructure around each service. One of the most valuable parts of this infrastructure has been the health checks we built into each service. Not only do they give us the instant reassurance that everything is working fine — they also help us gain a better understanding of what goes wrong. System health checks can be our very own car dashboard warning lights.

Typically, services are setup to expose status endpoints — which are used to inform a load balancer or service discovery system whether a particular instance of a service is available and decide whether to direct traffic to that service instance. Often the requirement of these systems is a simple 200 OK response or an OK message in the response body. While this can serve as a useful way of indicating whether a service is available, there is an opportunity here to do much more.

OK/Warn/Fail

While building out our first few services, we noticed that sometimes performance would degrade, but the service would still function as expected. While symptoms such as performance degradation do not necessarily require immediate attention, its valuable to understand when a system is under stress or some non critical area of functionality is not behaving as expected.

Rather than just relying on a simple 200 OK, we devised a response message that could provide a deeper insight into the behaviour of our systems. At the heart of this response message is a Status value. The value of which could be OK (to indicate all is fine), Fail (to indicate the service is broken) and Warn (to indicate that the service may be close to failure)

Along with Status our response includes some other information.

Name — the name of the service
Machine — the name of the machine on which the service is running
DateUTC — the UTC date on that machine
Version — the version number of the deployed service

This basic information adds a little more context to the response and allows us to confirm that the correct version of the service is running.

Tests

An OK/Fail message on its own is only so much use though. Its a bit like trying to turn on our car and either the engine starts or it doesn’t. The status is either on or off. But our car is much more helpful. If something is wrong it will tell us. We can do the same in our services.

In order to determine whether the service is OK or not, we defined a set of tests, relevant to the behaviour of the service. Each individual test will in turn return a OK/Warn/Fail status. And the most severe status from all of the tests can be used to determine the overall service status.

The set of tests that contribute to the overall service status are defined separately for each service and may cover things like database connectivity, dependant API statuses or message queue availability. In some cases our tests will even execute code and assert that the results are correct. Should any of these tests take a long time to execute, we typically return a Warn status.

By using these inline tests that run directly inside our deployed services we can go much further than simply checking whether a service is available. Knowing that our service can test itself gives us much more confidence in the system being reliable.

But not only do these tests help us to better understand whether our service is working or not, just like the car dashboard, we can use them to instantly diagnose issues. All we had to do was return each test result in our status message.

{
     name: "My Service",
     machine: "ENV-01",
     status: "Fail",
     dateUTC: "2016-10-03T16:18:05.2637424Z",
     version: "2.0.14",
     tests: [
     {
          name: "DB Connection",
          category: "Database",
          status: "OK",
          message: null,
          duration: 1
     }, 
     {
          name: "Some Function test",
          category: "Functionality",
          status: "Fail",
          message: "An error occurred when executing ..."
          duration: 7
     }
}

I’ve already lost count of how many times these tests have helped us quickly diagnose which service or component is at fault.

Alerting & Logging

As well as informing a load balancer or service discovery system. These service health checks can be used to power dashboards — making the service status and underlying tests highly visible to the entire team.

We also monitor our service status endpoints with an alerting system and send messages into our slack channels or pager duty when a Fail status appears so that we can quickly investigate and rectify critical issues before it affects our customers.

We quickly realised that, though these alerts were handy, if the system recovered before we could view the status page, then we couldn’t find out which test contributed to the Fail status. So we started logging our status endpoints and tests into prometheus.

Service status logging to Prometheus and displayed in Grafana

We can now look back historically at service failures and isolate exactly which test resulted in the Fail service status. We can use this to monitor up-time of the service instances as well as the individual components and identify the least reliable parts of each system.

By creating a more detailed service health check, we can quickly gain the reassurance that our services are up and running by glancing at a dashboard. And when something goes wrong, we will often know exactly why and when it happened.

Lessons learned

If you have not implemented service health checks into your applications I strongly recommend doing so. Even if you don’t have a microservice architecture, you can still create endpoints in your monolith and separate them by domain.

If you do decide to do so, here are a few lessons we learned along the way:

Standardise your message format — all services need to return a consistent message format so that you can analyse each service in the same way and aggregate their results on a dashboard.
Consistent endpoint location — try to stick to using the same URI across all services so that the status is easily discoverable.
Test for connectivity to dependant services — If your service depends on other APIs or services, include tests to check if they are accessible and working. This can help to quickly diagnose whether the fault is inside your service, across a network or a broken dependency.
Test core functionality — If your service does just one or a few specific things, include these pieces of functionality in the tests — just as you would with an integration or end to end test.
Add tests for common failures — If you notice a particular point of failure that frequently occurs and is not covered by your tests, consider adding a new test to help quickly identify when this issue re-occurs.
Cache intensive checks — If you need to create tests that can take a long time to run, or end up writing data, cache the results for a short duration to minimise impact on the system.
Make the results visible — Ensure that all services are monitored and made visible to the whole team so that issues can be quickly spotted, analysed and resolved.

Exposing detailed service status information is easy to overlook when rolling out new services, and while a simple OK/Fail message may help your systems understand service availability, a little extra effort can bring so much more.

The upfront investment of creating a detailed service status response could save you valuable time when it really matters and provide the constant reassurance that your services are behaving as you intended.

After all, we don’t want to drive a car without a dashboard and warning lights — so why should we accept it for our services.

System health checks: A lesson from our cars

OK/Warn/Fail

Tests

Alerting & Logging

Lessons learned

Written by Jon Menzies-Smith