Visual Patterns to Improve Monitoring Dashboards

Patrick Sudhaus
6 min readNov 30, 2022

Monitoring is essential to ensuring a system is running well in production. Without monitoring, you are driving a car with a blindfold on. It can go well up to a certain point, but inevitably, you will have a fender bender or even a deadly crash.

The whole purpose of monitoring is to detect issues before your users do. With thousands or millions of active users, these are quick to complain when something is not working correctly, hurting the experience and brand significantly.

This story is focused on visual monitoring dashboards and avoids everything else, namely alerting, anomaly detection, and instrumentation, among others. The monitoring technology stack is also irrelevant, as everything stated in this story can be applied to every popular visualization platform. Well-designed dashboards provide data that can be understood easily and interpreted at a glance to decide if further action is necessary and, ideally, even provide hints on where to start looking.

What to Monitor

The decision of what to monitor will affect the entire observability process. It is a two-edged sword: Monitor too much, and important information may be lost due to information overload. Monitor too little, and relevant data may be missed, causing issues to be detected too late.

A good starting point is to start from the outside and think about scenarios that will affect users directly. For example, an API endpoint responding with HTTP 500 errors or being extremely slow does not identify the root cause; however, it certainly brings visibility to an issue that must be investigated. In any event-based system, the dead letter topic is a good initial indicator that something is going wrong.

We don’t want (at least in the beginning) to obsess over technical metrics, e.g., the throughput of database operations. Sure, there could be an issue here, but the technical nature of this metric makes it difficult to tell good from evil, and who has any service level objectives (SLOs) on DB writes anyways. A database that stops answering will also be apparent in every upstream metric and must be investigated anyways.

Slicing Monitoring Dashboards

Similar to slicing microservices, you can cut monitoring dashboards along functional or technical seams.

  • Functional dashboards cover all necessary information required to understand if a component (e.g., a microservice) is working as expected or not. It will cover all aspects of that component (are API requests being answered correctly and quickly, are cron jobs running regularly, are asynchronous tasks being processed in a timely manner?)
  • Aspect-oriented dashboards cover a single aspect of the entire system. For example, in an event-driven system, it may be helpful to see all queues with the number of messages to be processed, the oldest message age, and the corresponding dead letter topics. Aspect-oriented dashboards provide a good overview across a single aspect of the entire system.

A mix of these provides the best observability as the aspect-oriented dashboards go hand-in-hand when analyzing an issue. The dashboard design patterns apply to functional and aspect-oriented dashboards alike.

Designing Dashboards

I once was on a call with a UX designer and, for a short moment, showed her one of our dashboards. Immediately she pointed out how user-unfriendly it is: Too much data at once, dozens of data visualization types, lacking structure, no visual cues, and no easy-to-understand indication of what is going well and what is not.

These kinds of dashboards provide all the necessary information but make it challenging to extract value from them, especially for developers new to the team or who don’t look at the dashboards often.

Colors

Colors are the premier way to indicate what is going well and what is not. Besides the traditional traffic light representation, blue is an excellent color to state information that is not rated as good or bad.

  • Green: The System aspect works as designed and requires no action.
  • Yellow: The system aspect is still within the expected behavior but is creeping towards a boundary. This may be an early indicator of an issue but does not always mean that there is an issue.
  • Red: The system aspect is having issues and needs to be investigated. The shade of red may indicate how critical a metric is.
  • Blue: The system aspect is left unrated. For example, neither a low nor a high throughput on an API endpoint means that something is going wrong per se. If there are known limits, it could turn orange at a certain threshold, but it is left unjudged other than that.
Service dashboard overview section using colors to pull attention to issues.

This summary section of a calculation service dashboard allows viewers to understand which calculation types are working and which are not. The top blue row states throughput without judging its values. Rows two and three contain the service level objectives (SLOs) in the name and contain a color-coded representation of the current value compared to the SLO. Immediately we can see that the second column processing is struggling. The last row shows a gauge representing the percentage of successful calculations, and clearly, the second gauge is screaming at us. Something is going wrong!

Consistent Components

As in UX design, a user is used to specific patterns which she knows how to interact with already. Applying this to dashboard design means that one type of representation should only be used for one kind of metric. Nothing is worse than having some time series graphs which are stacked and others that are not stacked. The same value looks very different and will confuse — at least at first glance.

State timeline depicting errors of a time of a REST API. Issue duration and recovery can be understood at a glance.

A state timeline is one of the best ways to depict different issue types which occurred in a given time period. Use this visualization for all system components, and immediately, anyone will know when to look into excessive issues. Depicting something else, like throughput, would be harmful as high throughput would look like a high error rate, even though these two are unrelated.

Visualization variations are a good solution when multiple metric types are best visualized using the same type. It ensures each component only has one meaning and, at the same time, is not limited to the functionality of the dashboard.

Visualizing data using different types of the same visualization.

Consistent Sections

Consistency applies not only to individual elements but also to sections. For example, say you own multiple microservices which publish events and subscribe to others. Although customized dashboards for different event types are not bad per se, a consistent layout with consistent data makes the most sense as it can be understood more quickly. Adding additional specific information is accepted as the high-level overview has already been understood by the viewer of the dashboard, and the context is known.

A message queue consumption dashboard section to be reused across services. Consistency helps compare and contrast different queue processes and understand them at a glance.

The dashboard section above describes all of the essential information for the consumption of events. The first row represents the active queue (green because we are up to date), and the second row shows the dead letter topic (red because we don’t ever like events that fail). The bottom half then goes into more detail describing how quickly processing is, what our throughput looked like, and finally, which errors in processing messages we met using the state timeline.

Absolute Numbers

Personally, I am on the fence if absolute numbers have a place in monitoring dashboards designed for system observability, as they should not creep into business intelligence dashboards. However, these are the most effective when talking with stakeholders outside the technical domain. “We are not able to process 5000 orders in the past 24 hours” is received with significantly higher urgency than “we are not processing 0.X% of orders.”

Dashboard section depicting failed processings in absolute numbers.

Feel free to add techniques you utilize to make monitoring dashboards more developer friendly and, thus, more useful for everyone involved.

--

--