SRE: Uptime: Probing 101 — Using Google’s Cloudprober

Published in

Dm03514 Tech Blog

10 min readJul 22, 2018

Blackbox uptime probes are simple, easy to operate and maintain checks which add a solid first level of system observability. Probes help to answer the questions:

Is the system reachable?
How long does it take to reach?

Uptime Probing is about availability in the strictest sense: given a request, is there a route to a service; and how long does that route take? Blackbox System Uptime Probes very similarly to ICMP Pings in that they are lightweight and easy way to determine if a service is reachable. By providing a historic view of latencies and reachability, they are a first level of debugging and a make for an excellent datapoint in correlating issues.

This article aims to explore what probes are and how to use them. We’ll also walk through an example of probing using an open source prober from google (cloudprober) and how to store the results in postgres.

Description

Uptime probes are a class of Active Black Box Checks supporting multiple levels of probing: Network Level (Ping), Transport (UDP) and Application (HTTP, DNS). Active checks are checks that are configured and initialized by a prober service (a daemon that issues probes using the specified protocols and reports on the probe results, such as cloudprober). Having checks centralized, and issued (Active), in the prober allows the prober to keep track of how many checks were made and the results of those checks. Knowing the total requests made allows for easily calculating the aggregate availability. While there are other checks which provide deeper system observability, probes are a light weight way to determine what is broken, by providing insights into if a service is reachable or not.

Black box checks take place from a client’s perspective, and provide insight into system reachability from a clients point of view. Services like Pingdom and New Relic Synthetics offer this for publicly visible systems.

Load balancers frequently use probes for their health checks to determine reachability of downstream services. These are simple active probe checks usually making decisions based on a binary reachable/not-reachable state, which also specify a max tolerable latency, (through a timeout) and number of retries until it marks the downstream service as unhealthy.

The value that using a tool like cloudprober provides is a historic view into the data collected by probes, which allows for cause correlation when system issues do occur. Below shows a prober that checks for service uptime (service level) through a service’s public interface. If there are redundant instances running behind the load balancer, the probe should only fail if all instances are not available. Of course there is a chance of false positives from probe failure or network issues between the prober and the load balancer, but having visibility into a service potentially not being available to an actual client far outweighs the false positives.

Contrast this with probing at the instance level. Below shows a prober probing each individual instance. This allows per instance reachability (instance level) to be recorded. In a container based infrastructure, probing instances needs to be combined with some sort of target discovery since most schedulers can preempt and rescheduled instances, resulting in frequently changing target IPs.

There’s also no reason both instance and service-level probing can’t be combined. Choosing the scope of probing is one of the crucial decisions in rolling out probing. If you’re just getting started with probing, service level probing may be considerably easier to get going (since load balancers are statically addressable, ie usually through DNS), while still providing extremely high value observability into system health.

What probes aren’t

Probes are not synthetic transactions. They don’t provide much insight into what an application is doing and if its functionally able to effectively fulfill its business value. There are many instances where a service could be reachable but not doing useful work.

Probes are not the only tool necessary to have full system observability. While probes are more useful than no observability at all, they are most useful when used in conjunction with synthetic transactions, white-box monitoring, ci service tests and application logging. They are significant in helping to establish a full multi-dimensional system view. Google defines availability as “whether a system is able to fulfill its intended function at a point in time.” Probes can’t provide full insights into this, but are a critical layer.

Probes are not a good fit for services that aren’t being asked anything. For request / response oriented services reachability is a prerequisite for functionality. For a reactive queue based service the same no longer holds up. In most cases, clients are no longer requesting things directly to the service. Certain frameworks/container schedulers (consul/kubernetes) may have health checks, but overall processing throughput or queue based metrics are a better indicator of system health.

Why use it?

Service level probing gives visibility into if a service is reachable from a clients perspective. Having this insight is absolutely crucial because if a request/response service isn’t reachable it will be unable to accept new requests, and provide clients with any value.

In the event of an incident being able to determine if a service is reachable is a primitive first step. During an incident establishing if a service is unreachable is a critical insight. If clients were reporting issues with a service, by knowing if the service is reachable, probes can help partition the problem space and invalidate a whole class of issues, and help accelerate remediation.

Probing allows us to classify something as binary available or unavailable. While this simplicity doesn’t give a good view of overall system health, it does provide critical first level insights by classifying something as “reachable” or “not reachable”.

How to start probing

Determine if you have a request/response service: Probing tells if a service is reachable or not. Probing an asynchronous service may not be that valuable. If a service consumes off a queue and is able to consume from that queue, probing may not be a great indicator of health. Compare this to a request/response service. If a client is unable to reach a request/response service then the service may be unable to provide any value.
Choose your Protocol: PING has less overhead but indicates less than HTTP, while HTTP can exercise HTTP application handling inside of a service. Testing a DNS lookups requires DNS.
Determine the client: Are we checking for instance uptime or service uptime. Service level uptime is more client centric, while instance level is more ops centric. Both are helpful and easy but client level has a lower barrier to entry as it can only require a single probe on a load balancer.
Choose appropriate probe rates: Suppose we decide to probe over HTTP and at the instance level; at a rate of 100ms, and our framework establishes a db connection on each request and there are 10 instances we have just caused: 10 requests / second * 10 mysql queries = 100 mysql queries / second. Cloudprober uses a rate of ~5 seconds in many of its examples. A probe rate of ~1 / minute may be a good # to get started with.
Determine how the probing data will be actionable: In the case of an incident how will the data be queried? How will probes be correlated to events? Google’s SRE book covers a lot of the organization aspects of this in depth. Metrics being collected that aren’t used are a waste of resources. Cloudprober supports exporting probe metrics to a number of different data sources.
Establish a baseline internal SLO to guide collection: Even if this means choosing an arbitrary number within your specific team. This will require a plan for probing, collecting, visualizing, and acting.

Uses:

Below are a couple of patterns which are enabled by probing and can provide strong indicators of issues. All are trivially achievable using a probing system such as cloudprober, and each can provide a valuable first level of insight into system health, reachability and performance.

SLA

Probes enable generating data for calculating service reachability SLO; such as a Service should be reachable at least 99.99% of the time. The below image shows the aggregate availability (successful requests over the total requests) for a test prober and service. The test server has errors injected in order to make the graph a little more interesting than a flat line

Objective automated metrics such as uptime are one of the cornerstones of SRE, serve as a proxy for quality, and help orient teams and product towards a common goal. Even though this is a superficial measurement it is still largely better than nothing. Google does a much better job of explaining the power of these techniques in their SRE Book and regularly on their Cloud Platform blog.

Suppose we establish a service level objective specifying that a service (someservice.company.com) will be available to probes 87% of the time.

With the probe data collected we can easily measure this and alert as soon as we have, or are approaching, an infraction.

Connectivity

Using probes to determine connectivity can also be powerful insight. Since probes (cloudprober) specifically expose binary success and failures it’s easy to analyze at the beginning of debugging. Timeouts could indicate client side failures.

Below are the total rates of probes and the success. Failures can either be timeouts or explicit failures (no status codes, broken connections, etc)

The failures combined with the timeouts can help to characterize problems as being network level or application level. Additionally, cloudprober provides http status code information for their http prober.

The above example shows probing timeouts. These are client configured timeouts, using go’s http.Client timeout. These are indicators that from the clients point of view the timeout was exceeded. While it could be a configuration issue (having timeouts that are unreasonably low) with appropriately scoped timeouts, and in combination with HTTP response statues codes, they could help indicate network level slowness.

Latency Analysis — Impending failures

Using latency information from probes is one of my favorite patterns as it helps to characterize client perspective of a service and it’s a strong indicator of issues.

Having a prober that keeps track of latencies as a histogram is critical. We can easily see what tail end latencies are and track latencies over time in order to correlate or predict issues. Visualizing this as a time series can provide critical insights into system understanding leading up to an event. The above graph shows a test service which is artificially injecting latencies.

Getting Started with Cloudprober Postgres

Cloudprober is a black box monitoring tool from google. It’s a free open source application which supports many different protocols and reporting backends. It supports emitting prometheus natively but is easily pluggable with many other backends. Additionally, it supports some more advanced features like target discovery (for instance level probes). It can be thought of as a self hosted pingdom, allowing for uptime of internal network resources to be monitored.

Cloudprober comes with a postgres data backend which makes it extremely accessible to introduce black box monitoring into an organization. At my own organization many teams and engineers are extremely familiar with postgres (and RDBMS’ in general) but are less familiar with prometheus, prometheus server, grafana and querying data in those. Cloudprober with postgres strikes a good balance of ease of setup and using familiar backend.

This tutorial will walk through configuring cloudprober to collect metrics over HTTP and save those to a postgres backend. We’ll then go through some examples of how to use those metrics to calculate basic uptime. All code is contained in the sre-tutorials github repo. In addition to cloudprober and postgres it contains a test server that can be configured to inject errors in order to give us some more variable data.

Start the test server: (the probe target) with a failure rate of 1/100 requests

(sretutorials) 1 vagrant@ubuntu-xenial:/vagrant_data/go/src/github.com/dm03514/sre-tutorials/availability/probing_101⟫ go run probetestserver/main.go -request-failure-rate 100
DEBUG: 2018/07/18 13:53:30 main.go:123: starting test server on: "127.0.0.1:5000"

Configure cloudprober to collect locally from test server (already in repo). The probe interval was artificially set low to generate data.

probe {
  name: "test_server"
  type: HTTP
  targets {
    host_names: "localhost"
  }

  interval_msec: 100
  timeout_msec: 90

  http_probe {
      protocol: HTTP
      port: 5000
  }
}

surfacer {
  type: POSTGRES
  postgres_surfacer {
    connection_string: "postgresql://root:root@localhost/cloudprober?sslmode=disable"
    metrics_table_name: "metrics"
  }
}

Start cloudprober and postgres which will cause cloudprober to immediately begin to probe the test server

vagrant@ubuntu-xenial:/vagrant_data/go/src/github.com/dm03514/sre-tutorials/availability/probing_101⟫ make start-postgres-stack

Current Availability

SELECT ( success.value / total.value ) AS current_availability 
FROM   (SELECT value 
        FROM   metrics 
        WHERE  labels ->> 'probe' = 'test_server' 
               AND metric_name = 'total' 
        ORDER  BY time DESC 
        LIMIT  1) total, 
       (SELECT value 
        FROM   metrics 
        WHERE  labels ->> 'probe' = 'test_server' 
               AND metric_name = 'success' 
        ORDER  BY time DESC 
        LIMIT  1) success

Executed using the project as:

docker exec -it postgres psql -d cloudprober -c "select (success.value / total.value) as current_availability FROM (select value from metrics WHERE labels->> 'probe' = 'test_server' and metric_name='total' order by time DESC limit 1) total, (select value from metrics WHERE labels->> 'probe' = 'test_server' and metric_name='success' order by time DESC limit 1) success"current_availability
-------------------
 0.935623378228098
(1 row)

Results in a current 93% uptime.

Rates/Visualization/SLA

For more complicated calculations and data visualization Grafana is an amazing platform which supports a postgres integration, and multiple visualization formats out of the box. While postgres + cloudprober is an extremely accessible technology, storing metrics in prometheus allows for hooking into prometheus rich ecosystem.

Conclusion

Probing is a lightweight reliable way to begin to introduce SRE concepts and observability into any organization. Tools like cloudprober further reduce the barrier to entry of introducing SRE data collection into any organization. I hope this article provides an overview of what probing is, why it is valuable and how to use it. I appreciate you reading and would love any feedback, thank you.