Of needles and haystacks (aka a root cause analysis primer)

Published in

Towards Application Data Monitoring

4 min readJan 21, 2021

With the adoption of microservices, not only do systems become wider and more distributed, they also become deeper.

The width comes from large monolithic systems being broken into more discrete units of functionality. The depth comes from the layered architecture of these services where there are front end services that call an ingress controller that calls a specific microservice to perform a task which in turn might make a call to a database or a third-party vendor application.

It’s easy to see how these types of architectures, while resilient and (mostly) decoupled, become impossible to understand over time and increasingly difficult to manage. Debugging systems and fixing problems occupies almost 40% of a developer’s time on average. Further, as software applications have grown more complex, so have the technologies and tools for observing and monitoring them.

Today, the term application monitoring is considered synonymous with Application Performance Monitoring (APM), a field that deals with understanding application performance by tracking metrics like latencies and errors, and helping detect, and resolve performance issues and outages. APM has become a well-established product category with successful companies like Datadog and Splunk.

But, ask any developer or devops engineer if they are able to quickly and reliably understand system dependencies, root causes or get early warning signals about issues and they will tell you that they still face many visibility gaps.

Even academic researchers are starting to dedicate more time to understanding the most efficient ways to perform root cause analysis. In their 2020 paper, the authors, Qiu et al, recognize several challenges with modern microservices architectures and call out the variety and complexity of monitoring metrics that are leveraged by APM tools as well as the rapid iteration cycles that are part and parcel of modern CI/CD processes. They note that traditional anomaly diagnosis methods are mainly based on KPI thresholds.

Further, they group root cause analysis methods into four distinct categories:

Event correlation analysis — statistical inference of related anomalies and events has been studied for the better part of a decade with some interesting work coming out of VMWare back in 2013. With the broadening of the machine learning toolkit there is increasing scope for improvement when it comes to using statistical methods to detect causality.

Log analysis — this is the most common pattern that we’ve observed in our client work. Although log analysis tools are becoming more powerful, typically they still require significant manual work and rely heavily on judgement, which in turn requires knowledge and expertise of the systems being monitored. Log clustering work was presented at IEEE in 2016 and academic research on the use of log clustering for intrusion detection root cause analysis dates back to 2003.

Execution path mining — which is essentially longhand for tracing, is becoming more popular for debugging modern applications. Sigelman et al’s widely cited 2010 work on Dapper became the foundation of OpenTracing and today, the domain is hotly contested by the largest APM companies. While tracing techniques preserve the sequence of calls for any given transaction, one of the challenges involves the level of instrumentation required to track call sequences with high fidelity.

Dependency graph mining — this is a topic that we are very passionate about. If service anomalies or failures can be combined with a graph depicting dependencies between the various services in a system it takes a very short leap to recognize that reliable root cause analysis is within reach. While the authors of the paper focus on framing root cause analysis as a path search problem and mainly worked on infrastructure issues like CPU burnout, we believe that broadening the lens is vital in real applications. If our objective is to understand why a service is suddenly returning error responses, we need to understand not only the dependencies affecting that service but also a) the data flowing into that service and b) the recent changes to related services, both upstream and downstream.

In fact, our team is hard at work, providing true situational awareness to developers, devops and SRE teams. We combine the power of the dependency graph that the authors are rightly focused on with deep visibility into the data layer along with full context of the change history across the system. We believe that this is the future of observability for modern stacks.

—

Photo by Bannon Morrissy on Unsplash

If you’re interested in learning more, please drop us a note via layer9.ai or follow us on Twitter @layer9ai or on LinkedIn @ Layer 9 AI

Of needles and haystacks (aka a root cause analysis primer)

Written by Arjun Dutt