OpenCensus and SLOs

The OpenCensus page Incident Debugging Workflow gives a roadmap for the use of OpenCensus and related open source tools to prepare for supporting applications in production, swiftly responding to incidents, assessing scope and severity, and identifying the cause. The page includes a short discussion on monitoring service level objectives with OpenCensus that I want to examine in more detail here. The example application used in the page is provided in the Go HTTP Server Integration guide. The nice thing about the OpenCensus HTTP integration is that you get basic instrumentation of your application code just by initializing the exporter and registering the HTTP integration views. Prometheus is used as the monitoring backend to store and visualize the metrics data collected by OpenCensus. Other monitoring backends, such as Stackdriver, could just as well be leveraged with this application using the pluggable OpenCensus export libraries or the agent exporters. However, Prometheus has a handy query language that can formulate expressions which can be used to compute a service level indicator. That is what I will demonstrate here.

The Google Site Reliability Engineering book defines a service level objective (SLO) as “a target value or range of values for a service level that is measured by an SLI.” A service level indicator (SLI) is “quantitative measure of some aspect of the level of service that is provided.” The example given on the OpenCensus page is the ratio of successful HTTP responses to total responses computed over a one minute interval:

SLI = QPS_200 / QPS

where QPS_200 is requests (queries) per second with a 2xx (successful) response and QPS is the total number of requests per second. This is a very simple SLI and has limitations. As mentioned in the OpenCensus page, a blackbox probe would be a better choice to more closely represent user experience. In any case, let’s go with success rate measured on the server for now and see how it works with an SLO violation.

Simulating an SLO Violation

Let’s suppose that we set the SLO at 99.9%. That is, if less than 99.9% of requests are returned successfully in one minute then we are out of SLO. The example program in the page always simulates results that are successful, which is not very exciting, so let’s change it to generate some errors. We could generate errors from a uniformly random distribution but that is not like areal outage unfolding. Low levels of background errors can easily be handled by retries and do not generally cause outages. And if we only had one minute out of SLO the impact on business would not be too severe. Most outages are due to either bad configuration pushes, software bugs, failures of dependent services, or localized failures that are not contained.

Let’s consider localized failures that are not contained. These can lead to major incidents in cascading failures that have a substantial business impact. For example, suppose that you have a server failure that shifts traffic to a second server. However, the second server is already carrying substantial load so that fails too. With the loss of two servers the suddenly shifting load takes down more servers. Rather than a random uniform distribution of errors we can model this with a random walk. A random walk in two dimensions is like an insect randomly wandering on the ground unpredictably further from its origin. We can simulate a random walk in a single dimension with accumulation of random increments in a variable. The aspect of this that is like real life is that we get random variation that can gradually wander from normal, past a point that is acceptable, to values where it will be very unlikely to return to the original, acceptable state.

Based on the example code in OpenCensus Go HTTP server guide, we add a function that simulates a random walk. If it gets beyond a given threshold, then we return a failure. The first modification of the example Go code should be added just after the import statements:

type randomWalkBool struct {
  maxInc, intVal, threshold int
}
func (rw *randomWalkBool) nextValue() bool {
  rw.intVal += rand.Intn(rw.maxInc) — rw.maxInc / 2
  return rw.intVal < rw.threshold
}

The intVal field in the struct stores an accumulating amount that randomly changes with every call to function nextValue(). We can instantiate it in the main function with an increment size like this:

func main() {
  rwSuccess := randomWalkBool{999, 0, 400}

It took a little experimenting to come up with the parameter values to simulate a failure because I did not go into the statistical theory of random walks. When we receive a HTTP request, we will return a server error if the next value is above the threshold:

originalHandler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
  if !rwSuccess.nextValue() {
    http.Error(w, “You got unlucky”, http.StatusServiceUnavailable)
    return
  }

Charting the Service Level Indicator in Prometheus

The request rate can be computed using Prometheus Query Language over a 1 minute interval with the expression:

rate(ochttp_tutorial_opencensus_io_http_server_response_count_by_status_code[1m])

A view of the chart for the request rate is shown below.

Screenshot:View of Request Rate in the Prometheus Console

The green line is the rate of errors and the red line is the rate of successful requests. A simple computation of the SLI is given by an expression directly translating the SLI formula:

rate(ochttp_tutorial_opencensus_io_http_server_response_count_by_status_code{http_status=~”2.*”}[1m]) / rate(ochttp_tutorial_opencensus_io_http_server_response_count_by_status_code[1m])

The chart for this is shown below.

Screenshot: Simple Calculation of SLI

It is a little disappointing with a continuous value of 1.0 broken be no values. The Prometheus best practices page on Histograms and Summaries suggests a better formula:

sum(rate(ochttp_tutorial_opencensus_io_http_server_response_count_by_status_code{http_status=~”2.*”}[1m])) / sum(rate(ochttp_tutorial_opencensus_io_http_server_response_count_by_status_code[1m]))

This generates the charts below:

Screenshot: Better Calculation of SLI

That looks a lot better. But wait. We are out of SLO! The next step would be to set up alarms so that you can detect that you are out of SLO and respond rapidly.