You can’t improve what you don’t measure

Dave North
Signiant Engineering
3 min readMay 11, 2017

How do you know if something is broken? How do you define broken? Is my service really that bad?

Some time ago I read this excellent interview with Ben Treynor and I was caught with the idea of “error budgets”. The basic idea is that no service ever has 100% uptime due to so many variables so come up with a reasonable value for “no errors” (say 99.95%) and “spend” the 0.05% error rate on moving fast and experiments.

At Signiant, we’re on an ongoing mission to move to more and more microservices for our SAAS offerings and we started to think how we could apply this concept to our environment.

In our model, all of our microservices are deployed to AWS using the EC2 container service (ECS). Each microservice has it’s own load balancer so we can get the stats from each load balancer. We’ve defined the error rate in our case as the ratio of HTTP requests to HTTP 5xx responses. While AWS cloudwatch can get these metrics individually, it can’t display or graph a calculated metric (unless you push it back as a custom metric). So we needed a tool for this.

Enter Librato

We’d looked at Librato in the past years ago but didn’t have a need for it at the time. However, this time it turns out it can do exactly what we need (and more!) using their feature called composite metrics. This allows you to perform manipulation on metrics and either graph the manipulated metric right away or save it as a new metric.

We then started to add the graphs to Librato for each microservice but we’re currently adding a new microservice at a rate of around 1 per sprint. So this would quickly become a time sink. In addition, when we re-create a service, we get a new load balancer (since we use AWS Cloudformation to setup all the services) so we didn’t want to re-visit Librato each time.

Enter aws-elb-to-librato

Yeah, it’s not a fancy name (maybe we can code-name it figet or something). However, it’s fairly neat as it accomplishes a few things:

  1. For our ECS based microservices, it will detect new microservices added to an ECS cluster and automatically add the metrics and graphs to Librato. Further, if a load balancer changes for a service, it will update the metric in Librato.
  2. For our Elastic Beanstalk based ‘heavier weight’ services, we also add graphs to Librato for these too. However, here we do blue|green deploys and have the concept of a live and a standby environment. In this case, aws-elb-to-librato will dynamically update the graph being shown to only be the live environment.

We ended up with dashboards that looked like this (in this case, the red services are here to illustrate the concept only!)

ECS Microservices
Elastic Beanstalk Services

What’s the Impact?

Being able to visualize these metrics has been pretty helpful in a few ways:

  1. We log/alert on server errors to a slack channel. It can seem there are a lot of errors when in actual fact, our error rate is very low. Graphing the ratio helps to measure how much of a problem we really have
  2. When we do have a problem, Librato lets us expand these “bignumber” metrics into a time series graph so we can zero in on a time and then go right to the logs for this time in Papertrail.
  3. Feedback. Our ops team can give objective feedback to the development teams about how their service is doing. Teams can monitor their own services now

The source mentioned here is available in our Github repo.

--

--