Building a Dashboard for a data processing pipeline using the Stackdriver Dashboard API
This is part 1 of a 2 part series.
- Part 1: covers identifying recommended Stackdriver Monitoring metrics to use for a data processing pipeline
- Part 2: covers how to using the Stackdriver Dashboards API to implement the charts and dashboards described in part 1 from a JSON template.
What should go in the Dashboard
I spend a lot of time talking to DevOps and SRE teams that are using or are considering how to use Stackdriver. One of the consistent questions that I receive is guidance on what metrics should be monitored.
The best overall guidance that I’ve seen on this topic comes from the Site Reliability Engineering book under Chapter 6 — Monitoring Distributed Systems. In the chapter, they discuss the “Four Golden Signals” which you should consider monitoring in your system.
The Four Golden Signals are:
- Latency — the time it takes for your service to fulfill a request
- Traffic — how much demand is directed at your service
- Errors — the rate at which your service fails
- Saturation — a measure of how close to fully utilized for the service’s resources
You can use these monitoring categories when considering the important metrics to monitor in your system or even data processing pipeline. For the purposes of monitoring, you can treat the pipeline as the “service” to be monitored. This means that you can consider the metrics in each product component and how those metrics can be applied to the “Four Golden Signals”. In the remainder of this post, I cover how you can map metrics to charts for an example data processing pipeline deployed in GCP.
The sample data processing pipeline
I extended a sample data processing pipeline based on this reference guide “Monitoring a Compute Engine footprint with Cloud Functions and Stackdriver” adding a simple Cloud Dataflow template for PubSub topic to BigQuery. The architecture described in the reference guide builds an inventory of Compute Engine instances and then writes the results to Stackdriver Monitoring.
In this post, the Cloud Functions are simply a data source for our Pub/Sub, Cloud Dataflow and BigQuery data processing pipeline. The architecture is comprised of a series of Cloud Functions connected via Pub/Sub which write the results to Stackdriver Monitoring. The addition of the Cloud Dataflow component allowed me to write the results to BigQuery in addition to Stackdriver Monitoring for the purposes of this pipeline example.
For the purposes of this example, I am focusing on the metrics to monitor the Pub/Sub entry point, Cloud Dataflow and BigQuery components which are highlighted in the red boxes in the diagram below.
You can generalize this pipeline to the following steps:
- Send metric data to a Pub/Sub topic
- Receive data from a Pub/Sub subscription in Cloud Dataflow
- Write the results to BigQuery
With the “Four Golden Signals” monitoring framework and the simple data processing pipeline, I can make recommendations about specific metrics and which of the monitoring signals should be monitored in the next sections.
The screenshot below shows the 6 different charts on a dashboard which provides coverage for the data processing pipeline.
Traffic represents how many requests are being serviced over a given time. A common way to measure traffic is requests/second. I chose to build 3 different charts for the 3 technologies in the data processing pipeline architecture (Pub/Sub, Cloud Dataflow and BigQuery) to make it easier to read because the y axis scales turned out to be orders of magnitude different for each metric. You may choose to include them on a single chart for simplicity.
Dataflow traffic chart
Stackdriver Monitoring provides many different metrics for Cloud Dataflow which you can find in the metrics documentation. Broadly, they are categorized into overall job metrics like
job/total_vcpu_time and processing metrics like
Since we’re looking to monitor the traffic through Cloud Dataflow, the
job/element_countwhich represents “The number of elements added to the pcollection so far” aligns well with measuring the amount of traffic. Importantly, the metric will increase with an increase in the volume of traffic. Thus, it’s a reasonable metric to use to understand the traffic coming into a pipeline.
The screenshot below captures the Cloud Dataflow traffic chart in the dashboard.
Pub/Sub traffic chart
Stackdriver Monitoring metrics for Pub/Sub are categorized into topic, subscription and snapshot metrics. Both subscription and topic metrics may be used to chart the traffic since they represent both sides of a messages published to Pub/Sub.
Since I want to see the amount of incoming traffic, looking at the metrics around the inbound topics that receive the data is a reasonable choice. Specifically, the
topic/send_request_countwhich represents the “Cumulative count of publish requests, grouped by result” aligns well with measuring the amount of traffic.
The screenshot below captures the Pub/Sub traffic chart in the dashboard.
BigQuery traffic chart
Stackdriver Monitoring metrics for BigQuery are categorized into bigquery_project, bigquery_dataset and query metrics.
Since I would like to see the amount of incoming traffic, looking at the metrics related to uploaded data is a reasonable choice. Specifically, the
storage/uploaded_bytesaligns well with measuring incoming traffic to BigQuery.
The screenshot below captures the BigQuery traffic chart in the dashboard.
Latency represents how long it takes to service a request over a given time. A common way to measure latency is time required to service a request in seconds. In this sample architecture with Pub/Sub, BigQuery and Cloud Dataflow, the metrics that may be useful to understand latency may indicate how long it takes to go through the Cloud Dataflow or steps in the Cloud Dataflow, how long a message is unacknowledged in Pub/Sub and how long it takes to insert records into BigQuery.
System lag chart
Since I’d like to see the amount of time that it takes to service requests, looking at the metrics related to processing time and lag area reasonable choices. Specifically, the
job/data_watermark_age which represents “The age (time since event timestamp) of the most recent item of data that has been fully processed by the pipeline” and the
job/system_lag which represents “The current maximum duration that an item of data has been awaiting processing, in seconds” aligns well with measuring the time taken to be processed through the Cloud Dataflow pipeline.
The screenshot below captures the Cloud Dataflow system lag chart in the dashboard.
Saturation represents how utilized the resources are that run your service. Saturation is meant to monitor a metric that show when the system may begin being constrained. In this sample architecture with Pub/Sub, BigQuery and Cloud Dataflow, the metrics that may be useful to understand saturation are the oldest unacknowledged messages (if processing slows down, then the messages will remain in Pub/Sub longer) and in Cloud Dataflow, the watermark age of the data (if processing slows down, then messages will take longer to get through the pipeline).
Since I’d like to see when the service is approaching provisioned capacity, one assumption that I can make is that the time to process a given message will slow down as the system approaches fully utilizing its resources. This may not always be the case for data processing pipelines in general. Since I am processing asynchronously with Pub/Sub and Cloud Dataflow, this assumption is a reasonable one.
job/data_watermark_age which we used above and the
topic/oldest_unacked_message_age_by_region which represents “Age (in seconds) of the oldest unacknowledged message in a topic” aligns well with measuring the increases in Cloud Dataflow processing time and time for the pipeline to receive/acknowledge input messages from Pub/Sub.
The screenshot below captures the Saturation chart for Pub/Sub and Cloud Dataflow in the dashboard.
Errors represents application errors, infrastructure errors or failure rates. The point here is to monitor a metric that shows an increased error rate when errors are encountered. In this sample architecture with Pub/Sub, BigQuery and Cloud Dataflow, the metrics that may be useful to understand saturation are the errors reported in the logs for Pub/Sub, Cloud Dataflow and BigQuery.
Data processing pipeline errors chart
Since I’d like to see the error rated for the service, I can look at the errors that are reported in the logs for the services included in the architecture.
log_entry_count which represents the "Number of log entries” specific for each of the 3 services aligns well with measuring the increases in the number of errors.
The screenshot below captures the Errors chart for Pub/Sub, Cloud Dataflow and BigQuery in the dashboard.
Using the dashboard
In this post, I have described an approach for selecting metrics for a data processing pipeline based on the “Four Golden Signals”. You can easily build this dashboard yourself by hand in the Dashboards section of Stackdriver Monitoring console. However, an even better approach is to use a dashboard template. Read part 2 of this series to learn how to deploy this dashboard from a JSON template.