Observability: Dashboard Patterns: Aggregate View

Published in

Dm03514 Tech Blog

6 min readMar 6, 2019

Providing a uniform consistent experience is critical in quickly debugging Issues getting information necessary should be low friction and familiar. One way to accomplish this is to have each service provide a uniform standardized view into its operation. An aggregate view can be used to characterize the overall health of a system by answering: How much work is a system doing? What’s the result of the work? and How long is it taking to perform the work? I think it’s critically important to find a convention for service dashboards in order to facilitate quick responses and uniform views into our system and this is a convention that I personally like and have found success with.

Why views?

During incidents all time is critical since the client experience, money and reputation is at stakes. In order to facilitate this and to promote client centric debugging uniform views into a system can be used. If views were OOP primitives they would be interfaces. Views should be parametrizable by (service, environment, etc) and show the same structured of metrics. While parametrizable dashboards are ideal for reusability and uniformity they are not always achievable, in this case views are conventions. Each service dashboard should expose these aggregates in order to create a more uniform experience. Anyone that understands how to interpret and correlate these aggregate signals will be able to do so for any service.

There are many different views a service might benefit from having:

Aggregate view (discussed here)
SLO View — Visualize the SLO’s for a given service
Component View —Language Runtime Stats: GC, event loop, ticks, os threads; go: goroutines, heap, gc. Client Libraries.
Service View — Service specific metrics, queue depths, branch rates, implementation stats, etc
System View — Memory of system executing the service, CPU, Disk, Memory, Network; could be docker container resources or virtual instance resources
Resources View — Load Balancers, External Queues Etc, which could also be expressed as their own Service Views.

Views are consistent windows into a particular dimension of an application. In order to attempt full visibility into a system would be seen through the sum of its views. Views also closely match levels of abstraction. It’s often helpful to choose a single level of abstraction and view the system in terms of that. Sometimes it is important to correlate events across levels of abstraction. Consider an event where client latencies are increasing. Correlating this with machine network latencies or disk latencies may be helpful. Views are consistent, uniform and standardized. They are a primitive to view a specific dimension of an application.

So what’s an Aggregate View?

An Aggregate view represents a top level view into the system, and provides a starting point for all debugging. Aggregate views help to inform if there is currently a client problem and helps do determine if there is an online incident or an offline incident. Aggregate views should be the uppermost sections of dashboards, containing 3 of the 4 Golden signals: Throughput, Availability, and Latency.

Throughput

Throughput is the amount of work a service is performing in a given interval. Throughput should be normalized in requests per second for easy correlation to all other metrics. If the units were to default to metric count per interval it may be misleading when comparing to other dashboards, which may have different intervals based on the sample length. Requests per second is lingua Franca rate unit.

This is the rate per second of requests over the entire service it answers “how much work is the service performing?” Since the point of services are to do something it is a foundational metric that other metrics are built on. This should be intuitive because latency is a by-product of work. We must work in order to be latent. Same as availability. In order to produce a result work has to be performed, making this a foundational metric in understanding service health and crafting a proportional incident response.

During an incident almost every insight is enriched from correlating work, and understanding how much work the service was performing at any given point in time. This is why this metric is the first metric on a dashboard.

Additionally, some metric systems allow easy expansion over machine. In the graph above datadog allows graphing the rate per machine and summing so not only does the graph show the aggregate request rate per second for the entire service, but it also exposes the per host rate in the graph, allowing for easy identification of hot machines.

Availability

Availability is the percentage of successful requests over the totally number of requests. Availability is expressed as a percentage: 99.9% of requests succeed.

The inverse of availability is error rate: 0.1% of requests failed. Availability is reported from a services perspective in the logical sense not the physical sense, indicating the availability of successful responses, and not necessarily if a service was reachable or not. Availability answers: What is the result of the work the service is doing?

Availability helps to visualize an increase in error rates for a service. Availability also helps to put incidents in perspective by qualifying how big of impact the results of work are having on a client. If the client is saying their requests are failing but the general request rate and error rate are consistent it helps to characterize the response. Contrast this with an event where the request rate is consistent but error rate is increasing. This is a much stronger indicator of an issue and would most likely require a different response than the first case. The important thing here is that decisions and proportionality of responses are based on data and not intuition.

Latency

The final aggregate is latency: the amount of time it takes to perform work. This is generally represented as a percentile (p99) latency per operation. This says that 99 percent of requests for a given operation took less than the time graphed. It helps to characterize trends in latency for a given endpoint. It allows correlating a client expectation with service performance.

It’s also a stronger indicator of client experience and is critical in enforcing client expectations through alerts: if there’s a sustained rate of high latencies than clients are being impacted. This needs to be graphed relative to a given operation or it could hide data. Consider a service with 2 operations: fast and slow. If the latency just showed the max latency for the server the slow operation would dominate that number hiding the fast latency. But suppose the fast operation comprised 90% of work done by the server? A latency increase in the fast operation could easily have a higher impact on the client than an increase in the slow operation, which is why latency is graphed per operation.

Conclusion

When Aggregate views are exposed for each service, they provide a uniform experience to get the pulse of how a service is performing. When the same view is shared across services it makes it possible for any responder to quickly understand and characterize an event. It’s important to note that Aggregate views are a starting point to debug, not an end and they require supporting metrics to explain why and how of the data they expose.

Thank you