Getting started with HTTP health checks

Steve Boak
The Opsee Blog
Published in
5 min readApr 26, 2016

--

In this guide I’ll introduce you to some of the ways you can use Opsee’s HTTP health checks to monitor your environment. Great health checks can save your dev team a ton of time by alerting you whenever a service is misbehaving or not responding the way you expect. You can learn more about the check types and targets we offer in our docs, but there are basically two kinds of targets for HTTP health checks in Opsee:

1. AWS resources (e.g. Security Groups, ELBs, EC2 instances)
2. URLs (any public, or private, URL or IP address that’s accessible from our EC2 instance)

With different combinations of targets and assertions (the rules determining a passing check) you can create lots of useful health checks in Opsee. The goal of this guide is to introduce you to our checks and inspire you to create checks that will save you time and help you find problems faster.

The examples I’m going to cover in this guide are:

  1. Elastic Load Balancer (ELB) service response
  2. Website availability with latency (opsee.com, backed by CloudFront CDN)
  3. Website content & defacement (google.com)
  4. ElasticSearch
  5. Consul
  6. Dropwizard

In each example I’ll list both the target of the check and the assertions used. These can be combined as well — for example, a single health check can cover the availability, performance, and content of your website. If you have any suggestions for this guide, or just want to get in touch, send an email to support@opsee.com.

Elastic Load Balancer (ELB) service response

Targeting a health check at a load balancer normally masks what’s going on behind it, but Opsee automatically tracks ELB membership and targets instances directly to ensure not only that the ELB is functioning but that everything running behind it is as well. As you scale up and down there’s no need to update your check, since Opsee is continuously tracking changes to the ELB definition.

Target: Elastic Load Balancer (ELB)

Assertions:

  • Status Code equal to 200
  • Header–Content-Type equal to application/json
  • Header–Access-Control-Allow-Methods equal to GET,POST,DELETE,HEAD
  • JSON Body — status equal to OK

Here, we’re checking that all of the backing instances are returning a 200 status code, that the content type is JSON, that the correct methods for the service are exposed, and that a specific key in the JSON response matches our expected value.

This check will work the same way when targeting a Security Group, Auto Scale Group, or just about any other dynamic resource in AWS.

Website availability with latency (opsee.com, backed by CloudFront CDN)

One of the most basic ways to track the health of your website is an availability check. Our website, opsee.com, is backed by AWS’s CloudFront CDN. As with any CDN, it resolves to multiple IP addresses. Opsee will track all of the DNS entries for the site and make sure they’re all responding. Add in latency and you have a basic understanding of performance too.

Target: https://opsee.com

Assertions:

  • Status Code equal to 200
  • Response Body is not empty
  • Metric–Round-Trip Time less than 1,000ms

The status code ensures there are no service-level errors, and the response body check is an extra level of security that the response is coming back complete. Sub-second RTT on a website is a pretty conservative latency check, so be sure to calibrate this against your expected service level.

Website content & defacement (Google.com)

Nasty business getting your website hacked. Let’s make sure that doesn’t happen by putting some assertions in place for the Content-Type header and response body, as well as the response code.

Target: https://www.google.com

Assertions:

  • Response Body contains <title>Google</title>
  • Header–Content-Type equal to text/html; charset=ISO-8859–1
  • Status Code equal to 200

Validating specific content in the response is the most granular way to ensure your site’s content is right. A title tag is pretty basic, so use something specific to your site. Headers sometimes get mistakenly changed too, so verifying that along with the status code of will ensure a healthy and accurate website response.

ElasticSearch

Elastic documents their health check endpoint well (https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html). Just target the /_cluster/health route and use the following assertions to ensure ES is in perfect health.

Target: http://localhost:9200/_cluster/health

Assertions:

  • JSON Body — status equal to green
  • JSON Body — unassigned_shards equal to 0
  • JSON Body — active_shards_percent_as_number equal to 100
  • Status Code equal to 200

A “red” status indicates that the specific shard is not allocated in the cluster, “yellow” means that the primary shard is allocated but replicas are not, and “green” means that all shards are allocated. The index level status is controlled by the worst shard status. The cluster status is controlled by the worst index status.

Consul

There are a few useful health check endpoints available to Consul users. In this example we’ll check that nothing is coming back in the critical state, and ensure that by checking for an empty array at that endpoint.

Target: demo.consul.io/v1/health/state/critical

Assertions:

  • Body equal to []
  • Status Code equal to 200

We’re simply ensuring that the service is responding and that there are no entries in the array of critical services. There are more detailed service and node checks available to.

DropWizard

The library provides a convention for defining health checks by extending the HealthCheck class. These can be pretty simple, returning just an “OK” or “FAIL” state based on the results of an internal check.

Target: http://localhost:8080

Assertions:

  • Body equal to OK
  • Status Code equal to 200

It’s good practice to verify any hard dependencies in a DropWizard health check. So if your service relies on a database, doing a simple `select 1;` against it in the health check endpoint will ensure that the database is connected and responding.

How do you health check your services?

We’d love to know how you health check your services. Write us a response, and tell us more about the services you’re checking and how you check them. We‘re always looking for ways to make Opsee better.

--

--