Implementing Semantic Monitoring

2016 was an exciting year. I don’t think there was another period in the almost ten years I’ve been in the software industry where I learned so much about web infrastructure. Building and running Jimdo’s PaaS has taught me a thing or two about reliability, scalability, usability, and other software “-ilities”. Today I want to write about one particular monitoring/testing strategy that has been invaluable to the success of our PaaS.

The difficulty of monitoring microservices

Our platform is composed of two dozen microservices sitting on top of Amazon Web Services (rumor has it we’re one of the largest users of ECS). When deploying a service via our CLI tool, the API request first hits AWS API Gateway, which will forward it to the corresponding microservice — our Deployer API in this case. The Deployer then enqueues a new job that will be picked up by a worker, which in turn deploys the service with the support of other microservices.

Breaking a system up into smaller services has a lot of benefits, but it also makes it more complex to monitor that the system as a whole is working correctly. For example, just because the services themselves report to be healthy doesn’t necessarily mean the integration points between them are fine too.

To verify that the Deployer works as expected, we execute a number of unit and integration tests as part of the CI pipeline. Besides, all of our services are fronted by a load balancer and backend replicas are replaced automatically at runtime when they become unhealthy. If that doesn’t help, PagerDuty will send us a friendly alert.

So we run some tests before deploying a new version of a microservice into production, and we use health checks to ping the service while it’s doing its job, e.g. processing API requests from impatient users. That’s a common way to validate production services. Unfortunately, it’s also a missed opportunity.

There are several problems with this approach:

  • We run the service-specific test suite only once and stop using it altogether when the service goes into production.
  • Apart from one-off integration tests, we don’t test the interaction between microservices continually.
  • Compared to CI tests, health check endpoints are usually dumb, often merely indicating if a service is running at all.
  • Additional low-level metrics like CPU utilization or response time are useful to pinpoint the cause of trouble, but they won’t give us a holistic view either.

There’s obviously a lot of room for improvement here. This is where semantic monitoring comes in.

Semantic monitoring combines test execution and realtime monitoring to continuously verify the behavior of applications. It lends itself in particular to validating microservices and how they interact at runtime.

Implementing semantic monitoring

How to implement semantic monitoring? The short answer: by feeding the results of end-to-end tests (consumer-driven contracts) into your existing monitoring solution. Those tests typically mimic user actions via fake events, e.g. a synthetic deployment, to ensure that the system behaves semantically (hence the name, semantic monitoring). What follows is a more or less detailed summary of the implementation we’re using at Jimdo.

At the heart of our setup is a set of black box tests. Those tests, which are written in Go, communicate directly with our API, just as users would do. Among other things, we have tests to ensure services and periodic jobs can be deployed and deleted successfully, both in staging and in production.

Due to the distributed nature of such systems, most tests boil down to “X should do A, B, C within T minutes”. As one API request can create a chain of downstream calls and events that are handled asynchronously, it’s a good idea to pass along a correlation ID. We output the unique resource IDs created during testing to be able to trace events through our systems should something go wrong.

To execute those tests continuously, we’ve configured a Jenkins job called “System-Tests”, which will run every hour and notify us about any failures in Slack. This, in fact, used to be the whole story for a long time: a couple of black box tests executed by Jenkins and Slack notifications that were easy to miss. That is, until Prometheus entered the picture.

At some point last year, we committed to using Prometheus for monitoring all the things from cluster instances to microservices running on our PaaS. And with Prometheus came the Pushgateway, which allows ephemeral and batch jobs to expose metrics in an easy way.

We created a generic Docker image to push a “freshness” metric to the Pushgateway. This metric — a timestamp plus some labels — can be used to determine whether a job was executed within a certain period or not.

We then added it to the System-Tests Jenkins job to write a freshness metric before and after running the tests:

push_build_metrics() {
docker run -t --rm -e PUSHGATEWAY="$PUSHGATEWAY_ADDR" \ job=Jenkins \
name=System-Tests branch=$BRANCH state=$1
push_build_metrics started
make test
push_build_metrics success

Afterward, we configured Prometheus to alert us if the job failed for three times in a row (at this point, we don’t trust Jenkins enough for this check to trigger PagerDuty at night):

ALERT SystemTestsFailing
IF time() - freshness{name="System-Tests",branch="master",state="success"} > 3.5*60*60
wonderland_env = "prod"
summary = "System-Tests Jenkins job wasn't successful for 3 hourly runs in a row."

And that’s the story of how we ended up implementing semantic monitoring on the cheap, based on building blocks already in place. All of the mentioned components for testing and monitoring can be used on their own. But by combining them, we can merge two separate but important verification techniques to monitor not only our microservices but also the integration points between them.

(If you want to learn more about monitoring microservices, I highly recommend reading Building Microservices by Sam Newman.)

P.S. This article first appeared on my Production Ready mailing list.