Health Checks like a Pro
We’re very excited to announce go-sundheit, our shiny new health checks library for golang, designed for high scale services, and large scale deployments.
The go-sundheit project is named after the German word Gesundheit which means ‘health’, and it is pronounced /ɡəˈzʊntˌhaɪ̯t/.
We started this project because at AppsFlyer, as in other fast-growing companies, we have a large operation which is managed by practicing continuous delivery. It is vital that our deployments and runtime are safe. This means that you have to know as soon as possible when your deployment has gone bad or when a resource that your service depends on is in poor shape. We need this so we can sleep well at night.
In order to achieve this level of safety, deployment orchestration systems such as Kubernetes, and discovery systems such as consul, require you to implement endpoints that will define the readiness and liveness of your service. These endpoints will be called upon deployment to verify the success of the deployment, and also called upon periodically to ensure the liveness and health of the service.
The main challenge is that you want these endpoints to be implemented correctly, and work well at scale. Both scale and correctness are sometimes overlooked. What I’ve seen many developers do is implement an endpoint returning
200 OK that looks more or less like so:
The problem with this implementation is that it only represents the availability of the service, or it’s responsiveness, but what it doesn’t tell you is whether the service is able to serve the API requests. This endpoint actually resembles a
/ping rather than a
/health endpoint. For example, imagine what would happen if Service-A in the diagram below relies on the DB for it’s serving. With the ping strategy, the service claims to be healthy, but it’s actually unable to serve the requests. This is why the health API must reflect our ability to serve requests.
The next step in the evolution of your infrastructure could be a health endpoint that upon request runs a series of checks, and returns
200 OK if those pass, or an error status otherwise. While this approach works well in many cases, and is not that hard to implement, it has a significant flaw. The fact that the checks run on each request to the health endpoint means that you can easily bring the service down to its knees if you call the endpoint too often, and you may also transitively create unnecessary pressure on the downstream dependencies. This scaling issue is often overlooked. At this point you can introduce all sorts of caching mechanisms, but in most cases you will still have requests that will take longer than they should due to the synchronous nature of the checks’ execution.
Is there a way out of this?
This is where go-sundheit comes into play.
The go-sundheit library allows you to define health checks that will check your service, your dependencies, or any other resource. These checks are then registered and scheduled to be executed periodically. The idea behind the scheduling of the checks, rather than having them executed on demand is to make the
/health API responsive, and to allow you to tune the rate at which you test your downstream dependencies.
Once you have registered your checks, you can register an HTTP endpoint that will expose your service health. This endpoint is aimed to be used by systems like Kubernetes, and Consul, and to be consumed at a rate suitable for the consumer. Since the checks run in the background, the endpoint returns the last known result, and never blocks. The go-sundheit library also exposes metrics that were designed to be consumed by your alerting system, or to be used to build dashboards for your system’s health.
Let’s see some code examples.
First we need to define a health check. Let’s write a DNS resolve check, that will verify that some host name can be resolved:
Please note that there’s no need to copy the above example. It’s just a slightly simplified of the built-in
NewHostResolveCheck check that can be found here: https://github.com/AppsFlyer/go-sundheit/blob/master/checks/dns.go#L14
Once we have defined our checks we can register them to be executed asynchronously as seen below. The example registers the built-in check that is already predefined for you:
This will create and schedule a DNS check for the example.com domain. The check will run every 10 seconds, require 1 resolved result, and will timeout after 200 milliseconds if we fail to get a response to our DNS query.
After we registered our checks, we can register our health endpoint:
The health endpoint can be queried like so:
Or by calling the more compact version:
The health endpoint will return a
503 Service Unavailable response code upon checks failure.
What else do you get from go-sundheit?
- A set of built-in checks: HTTP endpoint checks, DNS checks, Databases check, etc
- A set of built-in OpenCensus metrics (see here: https://github.com/AppsFlyer/go-sundheit#metrics)
- Easily defined custom checks. Basically any function returning an optional details and an error can be used as a custom check. This allows you to provide custom health checks from any library, while not having to couple yourself to go-sundheit.
go-sundheit is available on GitHub under the Apache License 2.0.
go-sundheit allows you to easily define periodic checks for your required resources and dependencies in a safe manner. In addition go-sundheit allows you to easily expose the health status to be consumed by tools such as Consul, and Kubernetes. go-sundheitt makes it easy to create your own checks, and provides a set of pre-built checks you can use (more will come).
We hope you enjoy using this library.