You build a great product. You offer it as a service. You define quality and performance Service Level Agreements (SLA) for your clients. You deploy monitoring to track Service Level Indicators to ensure you fulfill your SLA.
And now you want to create monitoring dashboards. To visualize the metrics you collect and understand your product’s behavior. How should you do that? What dashboards to create? What metrics should you add in each dashboard?
Strategy 1: Yeah, I do not know. We have metrics, we plot metrics
- All metrics, one dashboard: One image is worth 1 word, so you add 1000 small charts in one dashboard. Wonder why metrics or trends get missed.
- No correlation or flow between metric charts/panels: Dashboard cohesion does not matter. Plot metrics as you remember then in any order from any level and layer. Force your team to spend time scanning across the dashboard for related metric panels placed very far apart. Wonder why nobody is using the dashboard.
- Do not aggregate metrics: Plot a separate line on each panel for each process/server/service instance. Do you have 100 service instances reporting response time? Plot 100 lines on your response time panel. Wonder why it’s hard to assess the overall system behavior.
- Mix metrics from different levels in the same panel: Mix infrastructure and application level metrics in the same panel. Plot service instances count and response time in the same panel. Plot the number of processes and error rate in same panel. Wonder why panels are hard to read.
- No variables, no drill-down: It is what it is. Do not provide any variable parameters or drill-down options to select and view metrics for a certain client/service/environment. Assume that the dashboard will be cloned and any such selections hardcoded in each new clone. Wonder why it’s becoming more and more difficult to find the original dashboard.
- No overview dashboard: Create a separate dashboard for each metric level. Create a separate dashboard for each component. Make your team go through 3 dashboards and mentally link business to application to infrastructure metrics for determining overall system health. Wonder why onboarding new team members takes forever.
Strategy 2: Overview. Top-down. Left-right. Cohesive. Consistent.
- Overview dashboard: Build a dashboard to give a quick overview in the health of your system. Provide one top panel trumping everything, showing the highest level metric indicating system performance (or what we are tracking). A single glance at that panel should indicate if things are ok or not with our system.
- Top-down structure: Structure dashboard panels in rows and columns. Start from the highest level metrics at the top rows, and go down in metric level as you add additional rows. E.g. business-impacting metrics would be among the top panel rows, then application metrics, and infrastructure would be among the last.
- Column per component: If possible, reserve one column of panels per component or processing phase. Plot the same metric for each component/phase in separate horizontally arranged panels, on the same row. You should see from a first glance at the dashboard if the system has a problem. At a second glance you should see which component/processing phase has a problem. At a third glance you should see the problem source.
- Left-right structure: As you navigate your dashboard left to right, if possible, the panels should plot metrics according to the data flow in your system. You should be able to quickly identify where in the data flow there is a problem.
- Variable parameters and drill-down: Provide the option to restrict the display of metrics to particular clients, services, or environments, using variable parameters and drill-down menu options. This is extremely useful when problems appear only with individual clients or services, and enables an engineer to focus on them while debugging.
- Individual detailed dashboards: Provide individual dashboards for debugging individual services and components. They should enable deep dives and debugging problems with specific services or components. Keep the same top-down left-right structure for your individual dashboards. Consistency across dashboards decreases debugging effort.