Health Checks like a Pro

Introducing go-sundheit

Eran Harel
Aug 6, 2019 · 5 min read

We’re very excited to announce go-sundheit, our shiny new health checks library for golang, designed for high scale services, and large scale deployments.

The go-sundheit project is named after the German word Gesundheit which means ‘health’, and it is pronounced /ɡəˈzʊntˌhaɪ̯t/.

We started this project because at AppsFlyer, as in other fast-growing companies, we have a large operation which is managed by practicing continuous delivery. It is vital that our deployments and runtime are safe. This means that you have to know as soon as possible when your deployment has gone bad or when a resource that your service depends on is in poor shape. We need this so we can sleep well at night.

go-sundheit logo
go-sundheit logo

In order to achieve this level of safety, deployment orchestration systems such as , and discovery systems such as , require you to implement endpoints that will define the readiness and liveness of your service. These endpoints will be called upon deployment to verify the success of the deployment, and also called upon periodically to ensure the liveness and health of the service.

The main challenge is that you want these endpoints to be implemented correctly, and work well at scale. Both scale and correctness are sometimes overlooked. What I’ve seen many developers do is implement an endpoint returning 200 OK that looks more or less like so:

The problem with this implementation is that it only represents the availability of the service, or it’s responsiveness, but what it doesn’t tell you is whether the service is able to serve the API requests. This endpoint actually resembles a /ping rather than a /health endpoint. For example, imagine what would happen if Service-A in the diagram below relies on the DB for it’s serving. With the ping strategy, the service claims to be healthy, but it’s actually unable to serve the requests. This is why the health API must reflect our ability to serve requests.

Image for post
Image for post
The broken dependency problem

The next step in the evolution of your infrastructure could be a health endpoint that upon request runs a series of checks, and returns 200 OK if those pass, or an error status otherwise. While this approach works well in many cases, and is not that hard to implement, it has a significant flaw. The fact that the checks run on each request to the health endpoint means that you can easily bring the service down to its knees if you call the endpoint too often, and you may also transitively create unnecessary pressure on the downstream dependencies. This scaling issue is often overlooked. At this point you can introduce all sorts of caching mechanisms, but in most cases you will still have requests that will take longer than they should due to the synchronous nature of the checks’ execution.

Is there a way out of this?

This is where comes into play.

The library allows you to define health checks that will check your service, your dependencies, or any other resource. These checks are then registered and scheduled to be executed periodically. The idea behind the scheduling of the checks, rather than having them executed on demand is to make the /health API responsive, and to allow you to tune the rate at which you test your downstream dependencies.

Gophers Unite! If you’re a proud gopher like us, then you know…we are hiring. >> Go! (pun intended)

Once you have registered your checks, you can register an HTTP endpoint that will expose your service health. This endpoint is aimed to be used by systems like , and , and to be consumed at a rate suitable for the consumer. Since the checks run in the background, the endpoint returns the last known result, and never blocks. The library also exposes metrics that were designed to be consumed by your alerting system, or to be used to build dashboards for your system’s health.

Let’s see some code examples.

First we need to define a health check. Let’s write a DNS resolve check, that will verify that some host name can be resolved:

Please note that there’s no need to copy the above example. It’s just a slightly simplified of the built-in NewHostResolveCheck check that can be found here:

Once we have defined our checks we can register them to be executed asynchronously as seen below. The example registers the built-in check that is already predefined for you:

This will create and schedule a DNS check for the domain. The check will run every 10 seconds, require 1 resolved result, and will timeout after 200 milliseconds if we fail to get a response to our DNS query.

After we registered our checks, we can register our health endpoint:

The health endpoint can be queried like so:

Or by calling the more compact version:

The health endpoint will return a 503 Service Unavailable response code upon checks failure.

What else do you get from go-sundheit?


is available on GitHub under the Apache License 2.0.

allows you to easily define periodic checks for your required resources and dependencies in a safe manner. In addition allows you to easily expose the health status to be consumed by tools such as Consul, and Kubernetes. makes it easy to create your own checks, and provides a set of pre-built checks you can use (more will come).

We hope you enjoy using this library.


AppsFlyer Engineering

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store