Alert policy metric thresholds with Stackdriver and OpenCensus

Published in

Kudos Engineering

4 min readDec 4, 2018

In order to assert whether a distributed system is running correctly we can use tools such as testing (unit, integration, end-to-end tests), monitoring (health checks) and tracing. These tools provide great help to debug or profile why something didn’t work as expected.

However this can often lead to late insight in how the system performs. This is something that we have been working on within the Kudos system. In order to explain the concept in a domain more understandable outside of Kudos we’re going to use an example of a fictional takeaway shop.

The takeaway shop should have certain objectives tied to the business need such as: when they get the order (availability), how long does it take to get orders (latency), whether the order is correct or not (correctness).

Fitness functions, as defined in the book ‘Building Evolutionary Architectures’ by Neal Ford, Rebecca Parsons and Patrick Kua can help toward getting feedback on how a system performs. A fitness function in the book is defined as:

An architectural fitness function provides an objective integrity assessment of some architectural characteristic(s).

If we were to implement a system for a takeaway shop we could define a fitness function from the start of the project to try to satisfy the availability of our system. The function would need two measures and make a ratio of them to estimate how much orders have been fulfilled.

The first measure can be the total number of orders whilst the second one could be the dispatched orders. Given these two measures a good indicator of success would be to get the ratio of the dispatched orders/total orders. Any value under 1 would mean that all the orders were not satisfied while a value greater than 1 would mean that more delivery was dispatched than orders.

This is how we record the measures with OpenCensus:

First we define a Stackdriver exporter as the goal is to create an alert policy on the aggregation of two metrics. The exporter is then assigned to a view which defines an aggregation of type Count and the view references our measure for exporting which is done by calling stats.Records.

The above code is what’s required to send data, however there’s two pitfalls we’ve set aside to make the code easier to understand.

The first arises from the fact that OpenCensus operates asynchronously and sends the data in batches to the backend and the fact that our program could end when we reach the bottom of the file. To this extent we found that calling Flush on the exporter would force the data to be sent.

The second pitfall comes from the fact that our code is simplified. Calling stats.Record(ctx, dM.M(1)) only once is not going to help us to track all orders, the context and the measure needs to be exposed outside of our defined function and called every time an order or a delivery is done.

Once the measures are taken we can use an aggregation from Stackdriver to create an alert policy for our metrics.

The important bit is the ConditionThreshold which sets a failure if we have less deliveries than orders (less than 1 if we make the coefficient of the two). The ConditionThreshold defines the numerator (deliveries) and the denominator (the orders) by specifying which metrics are targeted (Filter and DenominatorFilter) and how they are aggregated.

The code above is far from being perfect, every time it’s executed it will create a new policy and maybe the alert policy would be better managed with something like terraform.

NOTE: Once this code is executed it will create the policy in Stackdriver but as this feature is still in beta it’s currently not possible to visualise our condition as a graph. This means we can receive alerts but we are not able to check what’s happening in the interface.

Having a metric threshold is still something we are exploring, however there’s a few questions we would like to answer. For example given that our shop opens for a certain amount of time during the day, we know we can define our alignment period to match the opening time, but what if we would like to have constant feedback during the day to check that orders are being continuously delivered? And in the case of data being processed which is less critical than delivering takeaways how can define a system which would make decision on when to alert engineers when the system is not performing which can be resumed on alerting at the first anomaly versus alerting when the system is degrading.

If all this sounds interesting to you, why not consider joining Kudos? We also have a primer on what you can expect your first day to be like.

Useful links:

Alert policy metric thresholds with Stackdriver and OpenCensus

Written by Kévin Etienne