Reimagining APM for the Cloud-Native World: Introducing SignalFx Microservices APM

Published in

signalfx

7 min readNov 15, 2018

We are very excited today to introduce SignalFx Microservices APM, the newest component of the SignalFx platform for monitoring of microservices-based applications. At SignalFx, we strive to deliver the industry’s most powerful cloud monitoring solution to accelerate our customers’ journey to cloud-native. With monoliths and legacy applications getting re-engineered as service-oriented architectures, enterprises need a new class of application monitoring tools to get visibility into transactions that now travel through complex paths across distributed services.

Traditional APM is a misfit for monitoring microservices. In a multi-part blog series, we outlined the shortcomings of the existing APM solutions and the need to fundamentally transform application monitoring to support cloud-native architectures.

Industry analysts have reported similar challenges experienced by their clients. In a recent research report Gartner said:

“Most APM solutions were designed for a prior generation of applications that were monolithic and long-lived. These approaches are ill-suited to the dynamism, modularity and scale of today’s emerging microservice-based applications”

Practitioners also see the complexity of microservices. This tweet, which went viral because it rings true, is a particularly humorous take on the challenge:

No doubt. Troubleshooting microservices in the middle of the night with traditional APM solutions seems like solving a murder mystery because:

There is no guarantee that you will see anomalous traces as you begin troubleshooting due to the random sampling of APM solution
There is no end-to-end distributed system view showing all the services and their interdependencies, or any correlation with highly ephemeral and dynamic infrastructure environments
There is no guided troubleshooting, which forces you to manually examine individual traces and find common patterns which may be causing the system-wide performance issue
There is no way to know what constitutes normal performance behavior based on historical trends

In this blog, we will lay out how SignalFx Microservices APM addresses these gaps with a unique set of capabilities and features:

NoSample™ Architecture

Traditional APM vendors and open-source solutions only capture a small and arbitrary portion of your transactions via probabilistic sampling, leaving you blind to actual issues. To quote one of our customers: “Sampling is the elephant in the war room!”

We tackled one of the biggest shortcomings of traditional APM with a unique approach to sampling trace data. SignalFx Microservices APM is built on a unique architecture that we refer to as NoSample™. Unlike other APM tools, SignalFx analyzes 100% of transactions throughout your distributed services and intelligently captures errors, anomalies, and outliers. This approach — also known as ‘tail-based sampling’ — is implemented via the SignalFx Smart Gateway, a highly scalable and intelligent relay that lives in the customers’ environment.

The chart below shows the stark difference between the head-based sampling strategy used by most of the APM vendors and the SignalFx NoSample Architecture approach. SignalFx NoSample approach captures all outliers so that you don’t miss crucial trace data when you are troubleshooting a performance issue. Our early testing with customers shows that visibility into anomalous and long-tail traces increases by 10x using a tail-based sampling approach.

Fig 1: Head-based sampling vs SignalFx NoSample Architecture approach

SignalFx NoSample Architecture assures that you will have the trace data when you need it the most to troubleshoot end-user issues.
Next, you need to narrow down to the right traces quickly to begin incident resolution.

End-to-End Observability in a Single Pane of Glass

Narrowing down whether a performance issue is caused by infrastructure or application code can like looking for needles in a haystack. Traditional APM tools require you to manually correlate performance issues across different layers of the application stack — resulting in higher MTTR, siloed troubleshooting, war room scenarios, and finger-pointing.

Fig 2: Service and Endpoint dashboard with infrastructure correlation

SignalFx Microservices APM provides an intuitive, end-to-end service map to quickly isolate the service which is causing the latency spike. You get pre-built dashboards for every service and all endpoints. Built-in infrastructure correlation helps immediately identify the root cause of a performance issue and engage the right team for resolution.

“SignalFx acts as a single source of truth for our teams. Service dashboards reduce our mean time to engage as they quickly narrow down the performance issue to code or infrastructure and help us engage the relevant team quickly”

Senior DevOps Manager, Manufacturing Design SaaS Firm

Directed Troubleshooting with SignalFx Outlier Analyzer™

Tagging metrics and traces with dimensional key-value pairs and labels is a common practice in modern monitoring systems.. However, as the number of dimensions grows, traditional APM solutions struggle to search and filter data without incurring performance penalties.

SignalFx provides a multi-dimensional data model and the industry’s best high-cardinality analytics capabilities, giving you the infinite flexibility to slice and dice trace data and quickly isolate relevant traces and spans.

Cloud-native deployments can be extremely complex to debug and troubleshoot because of the increased number of individual components backing an application. There can be many factors causing the latency of a transaction to go up. Where do you start your troubleshooting efforts? When using existing APM solutions, our customers told us they needed to examine each and every outlier trace and manually correlate among the traces to determine a pattern before starting troubleshooting.

SignalFx solves this challenge for our customers by using the latest innovations in data science.
Outlier Analyzer uncovers patterns relating trace tags to trace durations, highlighting possible explanations for degraded system performance (or slowness in steady state). It automatically can answer questions such as:

Are the long tail traces coming from a particular customer segment (whose requests might be large or somehow malformed)?
Do the slow traces tend to pass through the same (possibly overloaded or misconfigured) load balancer?

Fig 3: Outlier Analyzer surfacing most commonly represented patterns in the long tail transactions

Outlier Analyzer offers prescriptive insights to significantly reduce MTTR. One of our customers put it simply: “Before Outlier Analyzer we used to open 50 tabs and try to understand patterns manually ”

Know the Normal: Validate Code Releases with Span and Trace Metricization

When a particular span contributes most to the latency of a trace, how do you determine whether this is a normal behavior, or that a bug got introduced in a canary version of your code?

Other APM solutions capture RED metrics at the service level, or at best provide metrics at the root, originating span, giving you a very partial view of your environment.

The SignalFx Smart Gateway observes every single transaction across distributed services, assembles the traces, and metricizes all of your traces and spans into metrics automatically. Additionally, it keeps the distribution of the performance at the trace execution path, as well as at the span level.

Fig 4: Span performance details with historical comparison alongside infrastructure correlation — all within the trace context

Metricization provides you out-of-the-box, real-time visibility into the health of microservices deployed, as well as the historical performance trends at the span level. You can quickly determine how a new code release performs compared to historical baselines and automatically identify what is contributing the most to the latency of your transactions — down to the specific line of code.

In short, span-level metricization enables you to understand what constitutes normal performance behavior for any span or trace.

SignalFx does all of these things to expedite the incident response process and significantly reduce MTTR, while giving complete flexibility to our customers for instrumentation so they can remain vendor-neutral. You can choose any or a combination of following instrumentation methods:

Open Instrumentation Standards: OpenTracing, Zipkin, OpenCensus
Service mesh such as Istio, Envoy or Linkerd

Additionally, SignalFx Auto Instrumentation agents and libraries, built upon open standards, provides automatic instrumentation for the most commonly used open source packages and frameworks.

Conclusion

It’s no secret that every company is a software company today, and software is driving new digital business initiatives. It is also true that distributed systems are much more complex compared to monolithic environments. Today, we have taken a huge step toward helping our customers successfully adopt cloud-native architectures by cutting through microservices complexity with SignalFx Microservices APM.

Learn more

SignalFx Microservice APM is feature packed. We’ve just scratched the surface here and can’t wait to show you all the features we’ve built that will accelerate your journey to cloud-native.

Learn how our customers are already leveraging Microservices APM to diagnose the root cause to their issues and drive down MTTR
See it live at AWS re:Invent We’ll be at re:Invent (Booth #1613) demonstrating the only cloud monitoring platform with streaming analytics and NoSample™ tail-based distributed tracing. See a live demo. Meet our executive team. Win an experience of a lifetime.
Learn more