Autonomous application performance management (AAPM) is a new paradigm of application performance management (APM) which automates operational tasks such as failure rule configuration, failure rule tuning, failure triage, and correlation detection for root cause analysis. The volume, variety, and speed of application changes, especially driven by the growing adoption of microservices, warrant AI-enabled intelligent automation in lieu of traditional manual management or rule-based programming.
First, what is application performance management (APM)?
APM tools help software engineers monitor and troubleshoot failure risks. They typically have two functional components: Digital Experience Monitoring (DEM) and Application Discovery, Tracing, and Diagnosis (ADTD).
Digital experience monitoring (DEM) focuses on failure risk monitoring. It entails real-time metrics monitoring of IT services for performance and availability status evaluation. The typical KPIs are API response time/latency, traffic, HTTP/transaction errors, resource utilization, and synthetic metrics such as Apdex. Metrics are often aggregated over the transaction space to reduce monitoring overhead and make monitoring more actionable. For example, instead of monitoring the latencies of all service requests, it is a common practice to track 99-percentile latency. Application Discovery, Tracing, and Diagnosis (ADTD) is a field of application performance management that predominantly focuses on discovering the interaction among service components and the usage of distributed tracings and event logs for troubleshooting.
Why AI is needed for the modern application performance management?
Kubernetes is driving the mainstream enterprise adoption of microservices. The key characteristic of microservice-based applications is a large number of inter-dependent metrics of high variance and multiple scales. The large metric count and inter-dependency are quite obvious because microservice architecture means a large number of interdependent components. Let me throw some light on high variance and multiple scales.
A metric data often exhibits a high-variance nature, which could be attributed to stochastic application features such as uncertain user traffic, dynamic interdependencies across services, infrastructure performance etc. For example, px_volume_writethroughput metric in Figure 1 exhibits a high-variance nature where the distribution is rapidly changing with time. The failure risk monitoring for a non-stationary metric is hard because it is hard to write a static health rule without compromising error margins. Figure 2 illustrates this problem of distribution shift more vividly for px_volume_writethroughput metric on an hourly basis.
In microservice architecture, a large number of services works in unison to build an application. Metrics from different services can have different length and frequency scales. The multiple scale is more common for a distributed environment like Kubernetes and Portworx. Figure 3 shows an example of interrelated metrics of multiple scales. There are length scales of 1 (px_disk_stats_progess_io) to 10e+7 (px_pool_stats_pool_written_bytes). Each of individual metrics often has differnent frequency scales too, as shown in Figure 4 for an AWS CPU utilization metric. These large variations in length and frequency scales make the traditional metric compare-and-contrast for root cause analysis hard. This is simply a problem beyond the human scale.
In summary, the need for AI/ML in failure monitoring is warranted for high-variance metrics and that in root cause analysis for multiple scales in metrics.
Why autonomous APM is a hard AI problem?
An autonomous APM needs to automate failure rule configuration, failure rule tuning, correlation detection, and correlation relevancy detection. This is a multi-scale transient problem with rapidly changing metrics data distribution and metrics of diverse length scales.
Rapidly changing time-series distribution leads to covariate shift which makes failure risk monitoring by traditional anomaly monitoring hard. Covariate shift is time-series data distribution shift between training data and deployment data. It makes application of traditional AI/ML (e.g., support vector machine and deep learning) really hard. The avoidance of this problem would require ensemble learning which entails careful data curation which is nearly impossible for a production software application.
Scale variance is a critical challenge for correlation detection. Traditional pointwise correlation detection incurs significant error in the face of scale variance. It obscures the correlation relevancy across a large number of metrics (often that 10 metrics is too onerous for software engineers).
The traditional AI brings high operating friction from feature engineering (for non-deep learning algorithms) and careful learning orchestration to eliminate out-of-distribution (OOD) problem from the covariate shift. Moreover, hyper-parameter tuning to gain a satisfactory model fidelity could also be a serious drag. All these problems add up to prohibitively expensive data and processing overhead.
The lack of contextuality of the traditional AI algorithms poses another big challenge for AI adoption in the APM world. Software engineers have to deal with rapid baseline changes. For example, a CPU utilization spike which was an anomaly in the morning might not be considered as an anomaly in the afternoon with the success of a feature rollout. While a traffic surge, which is normal during a Black Friday, could be abnormal otherwise. The contextuality assignment for anomaly monitoring and metrics relevancy ranking for a root cause analysis is a big challenge. The lack thereof often leads to high false alarms, improper error margin usage, and the worst of all: catastrophic cascading failures.
Finally, is autonomous APM is just another buzzword or does it have practical business relevance?
In any organization especially in tech., there is an inevitable conflict between innovation and reliability. Developers want to push innovative features fast and stay ahead in the competition curve. Site reliability engineers, on the other hand, prefer to move slow without wasting precious error margin. An autonomous APM will mitigate this conflict and bring about two key benefits for organizations:
- Lower likelihood of application degradation which costs $10K/min (https://www.atlassian.com/incident-management/kpis/cost-of-downtime)
- Lower likelihood of alarm floods which leads to series productivity loss for software engineers, estimated to be 25% loss and ~50K per annum. per engineer.
We, at AdeptDC, strongly believe we have made some significant progress in realizing our vision of autonomous application performance management. Established APM players like Dynatrace, AppDynamics, New Relic are working towards AAPM. We believe our home-brew AI algorithm differentiates us with its higher accuracy with minimal data and maintenance overhead. Please free to give us a shout at firstname.lastname@example.org. Better yet, you can get started with our freemium version from our website: adeptdc.com