The Single Pain of Glass

Jamie Allen
Site Reliability Engineering Leadership
4 min readApr 22, 2023

How do we create better dashboards?

Yes, the title is an intentional joke about “pane” versus “pain.” I frequently hear leaders say they wish they had a “single pane of glass” (aka SPOG) to visualize all of the possible metrics that could impact a service, an application, or their entire business. That ends up with a lot of tiny graphs all mashed together into a single dashboard, or a bunch of monitors stood up against a big wall.

Frankly, it becomes difficult to see anything, and I think there is a better way. At one large company where I worked, many application or service teams would have a very large monitor on a stand close to their team area, each with a bunch of metrics displayed. It certainly looked very cool. However, nobody paid any attention to it. Teams would get up and go to lunch and walk right past the dashboard, not noticing if there was an incident in progress. The large screen and the contents of the dashboard almost felt like they were largely for show, to impress others who walked by.

Can a Single Pane of Glass Really Exist?

If we look at a mature, global deployment, how does even a single metric displayed on a dashboard help you? Let’s take a simple graph for Error Rate, where we’re tracking the number of unsuccessfully handled requests to a service. These requests might fail because of a database timeout, or a 500 error in the middleware, or whatever. And let’s imagine that the service is deployed in AWS to us-east-2a and us-east-2b (Ohio region, 2 availability zones) and us-west-2a and us-west-2b (Oregon region, 2 availability zones). This is a relatively simple deployment, because it has regional and data center redundancy, with no local zones or POPs to consider as well.

Assume we have an SLO of 99.9% of requests expected to be successful (999 out of 1000, also not crazy high). If we break our SLO and have 5 errors out of 1000 show up in a us-east-2b deployment, how do we display that in our SPOG? We’d have to have at least 4 widgets for the 4 possible deployment footprints of the service. And that also doesn’t take into account the tenancy model of the deployment (single user deployment, or multiple tenant deployment in SaaS). Assuming the service has set 2–3 SLIs, and has built a SPOG for showing all deployments of all hosts in all clusters in all data centers in all regions, this adds up to a lot of dashboard widgets fast. And all of them will be ignored because they’re too small and do not grab your attention when there is a problem.

A Perfect Dashboard

I think that we need a better approach. What if you had a dashboard that simply showed a green box when all observed metrics were nominal, but turned amber as any one of them got within some tolerance of an SLO, or red when an SLO has been broken? Nothing else displayed, just this color (with maybe a flashing annotation for those who are color blind). You would then click through the red/flashing word/etc to see WHERE there is a problem — is it a systemic problem, or local to a single host/cluster/data center/region? And then click through the map to see WHY there is a problem — latency has spiked in a Kubernetes cluster or pod in us-east-2a, You would then have secondary and tertiary dashboards and traces that show you WHY latency in that cluster has spiked, such as database slowness, or network response time has spiked. Each SLO should have its own secondary and tertiary dashboards behind the current status explaining why they are not performing as expected, not lumped together with other metrics and views for other measures. For example, do not have secondary dashboards for Error Rate in the same view as the secondary dashboards for Latency, it’s meant to be a directed tree of information.

Another benefit is that, with this approach, we can also discern when an issue is occurring for a specific kind of tenant in our footprint, which is a big deal for anyone running a multi-tenant SaaS application/service with varying deployment models. Imagine you have Tier 1 users of your service, and they have their own distinct deployment footprints for maximum performance and SLA attainment. Then imagine you have Tier 2 users in shared deployment footprints with lower SLAs (see my post about Tiered SLAs). You can see that you have a latency problem in a specific region and availability zone, and then see it’s impacting the hosts and clusters for your most important customers, which is extremely valuable.

The goal of a dashboard is to provide actionable information quickly. Trying to push too much information into a Single Pane of Glass has shown itself to be a recipe for being completely ignored. Instead, create dashboard views that tell you when you have a problem, where the problem is, what the problem is, and why it is occurring in targeted clicks.

--

--

Jamie Allen
Site Reliability Engineering Leadership

SRE CTO. Ex-Software engineering leader behind Starbucks Rewards and MOP. Ex-Facebook SRE leader.