A Microscope on Microservices
At Netflix we pioneer new cloud architectures and technologies to operate at massive scale — a scale which breaks most monitoring and analysis tools. The challenge is not just handling a massive instance count but to also provide quick, actionable insight for a large-scale, microservice-based architecture. Out of necessity we’ve developed our own tools for performance and reliability analysis, which we’ve also been open-sourcing (e.g., Atlas for cloud-wide monitoring). In this post we’ll discuss tools that the Cloud Performance and Reliability team have been developing, which are used together like a microscope switching between different magnifications as needed.
Request Flow (10X Magnification)
We’ll start at a wide scale, visualizing the relationship between microservices to see which are called and how much time is spent in each:
Using an in-house dapper-like framework, we are able to layer the request demand through the aggregate infrastructure onto a simple visualization. This internal utility, Slalom, allows a given service to understand upstream and downstream dependencies, their contribution on service demand, and the general health of said requests. Data is initially represented through d3-based Sankey diagrams, with a detailed breakdown on absolute service demand and response status codes.
This high-level overview gives a general picture of all the distributed services that are composed to satisfy a request. The height of each service node shows the amount of demand on that service, with the outgoing links showing demand on a downstream service relative to its siblings.
Double-clicking a single service exposes the bi-directional demand over the time window:
The macro visualization afforded by Slalom is limited based on the data available in the underlying metrics that are sampled. To bring into focus additional metric dimensions beyond simple IPC interactions we built another tool, Mogul.
Show me my bottleneck! (100X)
The ability to decompose where time is spent both within and across the fleet of microservices can be a challenge given the number of dependencies. Such information can be leveraged to identify the root cause of performance degradation or identify areas ripe for optimization within a given microservice. Our Mogul utility consumes data from Netflix’s recently open-sourced Atlas monitoring framework, applies correlation between metrics, and selects those most likely to be responsible for changes in demand on a given microservice. The different resources evaluated include:
- System resource demand (CPU, network, disk)
- JVM pressure (thread contention, garbage collection)
- Service IPC calls
- Persistency-related calls (EVCache, Cassandra)
- Errors and timeouts
It is not uncommon for a mogul query to pull thousands of metrics, subsequently reducing to tens of metrics through correlation with system demand. In the following example, we were able to quickly identify which downstream service was causing performance issues for the service under study. This particular microservice has over 40,000 metrics. Mogul reduced this internally to just over 2000 metrics via pattern matching, then correlated the top 4–6 interesting metrics grouped into classifications.
The following diagram displays a perturbation in the microservice response time (blue line) as it moves from ~125 to over 300 milliseconds. The underlying graphs identifies those downstream calls that have a time-correlated increase in system demand.
Like Slalom, Mogul uses Little’s Law — the product of response time and throughput — to compute service demand.
My instance is bad … or is it? (1000X)
Those running on the cloud or virtualized environments are not unfamiliar with the phrase “my instance is bad.” To evaluate if a given host is unstable or under pressure it is important to have the right metrics available on-demand and at a high resolution (5 seconds or less). Enter Vector, our on-instance performance monitoring framework which exposes hand-picked, high-resolution system metrics to every engineer’s browser. Leveraging the battle tested system monitoring framework Performance Co-Pilot (pcp) we are able to layer on a UI that polls instance-level metrics between every 1 and 5 seconds.
This resolution of system data exposes possible multi-modal performance behavior not visible with higher-level aggregations. Many times a runaway thread has been identified as a root cause of performance issues while overall CPU utilization remains low. Vector abstracts away the complexity typically required with logging onto a system and running a large number of commands from the shell.
A key feature of Vector and pcp is extensibility. We have created multiple custom pcp agents to expose additional key performance views. One example is a flamegraph generated by sampling the on-host Java process using jstack. This view allows an engineer to quickly drill on where the Java process is spending CPU time.
Next Steps… To Infinity and Beyond
The above tools have proved invaluable in the domain of performance and reliability analysis at Netflix, and we are looking to open source Vector in the coming months. In the meantime we continue to extend our toolset by improving instrumentation capabilities at a base level. One example is a patch on OpenJDK which allows the generation of extended stack trace data that can be used to visualize system-through-user space time in the process stack.
It quickly became apparent at Netflix’s scale that viewing the performance of the aggregate system through a single lens would be insufficient. Many commercial tools promise a one-stop shop but have rarely scaled to meet our needs. Working from a macro-to-micro view, our team developed tools based upon the use cases we most frequently analyze and triage. The result is much like a microscope which lets engineering teams select the focal length that most directly targets their dimension of interest.
As one engineer on our team puts it, “Most current performance tool methodologies are so 1990s.” Finding and dealing with future observability challenges is key to our charter, and we have the team and drive to accomplish it. If you would like to join us in tackling this kind of work, we are hiring!
Originally published at techblog.netflix.com on February 18, 2015.