“Prevention is Better than Cure” — Measuring the Health of Software Applications
“Prevention is better than cure” ~ Erasmus
Erasmus’ quote, though written 500 years ago, applies to software applications. Once an application fails in production, it can create a high impact on dependent teams and lead to revenue loss. This can be avoided if we prevent production failures.
Much like the human body, an application can show symptoms and signals before it crashes. We have to pay attention to these signals and symptoms as early warnings from our application.
When you create a software application, can you take the pulse of your app to detect any abnormalities in signals or metrics? You might think you could leave that to Application Performance Monitoring (APM) products. But APM products only monitor standard operating system metrics and runtime (eg; Java virtual machine) specific metrics.
If we could capture the signals that impact our application directly and expose them as health metrics, then we would be able to detect any abnormality in our applications much earlier and prevent production incidents.
Our team recently designed and developed a streaming Extract, Transform, Load (ETL) pipeline for consuming events from Kafka and writing the consumed events to Google BigQuery. This ETL pipeline is critical to PayPal as most of the analytical readouts are based on this data from BigQuery. This pipeline handles approximately 30–55 billion events per day.
As this is a streaming ETL application, events are written to BigQuery in real-time. We cannot afford any downtime due to failures, as that would impact downstream apps and analytical read-outs.
Our team decided to capture the following health metrics from our application so that we get warnings before a production incident occurs.
Health Metrics
BQ Streaming Latency: Events are written to BigQuery as a batch of 100 events. This streaming is from on-premise to Google Cloud Platform (GCP). PayPal has interconnected from on-premise to GCP. It is important to have visibility on regular latency and be alerted if latency goes above the acceptable threshold for a box.
BQ Streaming Retry: When a batch of events is written to BigQuery, the response error codes are classified as retryable and non-retryable errors. Retryable errors are retryed until they are successfully written to BigQuery. Hourly retry counts are captured and the percentage of retrys are calculated and shown in our dashboard.
Peak Throughput Per Sec: After benchmarking our application on a two core machine, we got 8,000 events per sec as the throughput for a single instance in our application. We reserved 50% as the catchup capacity for our application. So when the traffic throughput goes above 4,000, we get an alert. Then we can work to add more instances to our application pool or upgrade from two core to four core machines.
Active Consumers vs Total Consumers: Each of our application instances is a Kafka rebalancing consumer to consume events from Kafka. As Kafka rebalancing consumers are highly available, any single instance failure would go unnoticed as the consumers will rebalance and start to consume from other partitions. It is important to monitor the total number of consumers and the active consumers for each of our Kafka consumer groups.
Delayed Partitions: If consumption from any one of the Kafka partitions is lagging behind compared to other partitions, this metric captures that. The root cause of these issues could be an issue with Kafka consumers or Kafka brokers. This metric helps us trigger an alert for further investigation.
Pipeline Delay: A partition is marked as delayed if there is a delay with a single partition. The pipeline status is also marked as delayed if 100% of data is not available due to delay from one partition. The pipeline status is populated with the correct date and time until 100% of data is available for consumption.
Received vs Processed Events Count: These counts help to match the hourly number of events received/consumed from Kafka and the number of events written to BigQuery. Hourly Integrity Percentage is derived from these metrics.
Streaming Inserts Quota Usage: BigQuery (BQ) limits the amounts of data that can be inserted into BQ tables under a GCP Project. The default limit is 1GB per sec, and this can be increased by creating a Google support ticket. We requested and got a higher quota for our GCP Project. But it is important to know the usage of our quota so we can request for further increase when we reach 70% of our quota usage.
Some of the above health metrics were easy to capture. But metrics like Delayed Partitions were a little trickier, as identifying the delay can be only possible if we compare with other consumers. So we had to make design changes to support this. But it is worth the effort, as “prevention is better than cure.”
Conclusion
Software development does not end when the application starts to work as expected. It is the developer’s responsibility to gather needed health metrics from the application to detect abnormal behavior due to changes in the environment. We have included some of the health metrics for our application. We will keep adding metrics to this list as we experience more about our application.
Our Team: Shobana Neelakantan, Archit Agarwal, Vignesh Raj K, Govinda Raj