Low Overhead Continuous Contextual Production Profiling

7 min readMar 21, 2023

by Ravi Pulle

It is common to investigate performance bottlenecks like CPU and Memory in production applications to find and fix the root causes. To get visibility into the problem, we collect various metrics and logs to monitor resource utilization like CPU, memory, and application-specific latencies. This data collection does not cause any degradation to application performance. Still, unfortunately, this data will not give visibility into operations at the code level, which are the method, socket, thread states, etc. but collecting these profiles introduces overhead on application run time. They also are significantly large to store and visualize. In this blog, we are going to discuss about low overhead high-frequency profilers, attaching application context to profile samples, collecting profiles of tens of thousands of application deployments, efficiently storing and visualizing stacks, a new approach to comparative analysis, aggregation for a more extended period and aggregation of profiles across a large number of similar deployments.

Why is production profiling critical for root cause analysis?

The usage patterns of a production environment vary a lot, making it difficult to simulate some of the production incidents in test environments to triage and find root causes. We will not have access to production environments for several reasons (security and trust etc.). Then when an incident happens, we need to narrow down problems incredibly rapidly to unblock customers, which demands production profiling and diagnostics data to be available all the time.

Selecting a profiler
Note: Though profiling concepts are generic to various languages, we use JVM profiling in this blog for the topic of interest.

Profilers can provide many data dimensions, and it is impossible to capture all of them due to performance and data constraints. The first step is to identify what we need to analyze production incidents but a minimum of method profiles to detect CPU regressions, socket I/O to analyze network-related issues, monitor deadlocks, and wait and memory profiles.

Various profiling techniques are available, but they have limitations and must be cautious while solving a given use case. For example, statistical sampling is not entirely accurate but has low application overhead, whereas instrumentation profiling is more accurate but disruptive to application run times. Also, when it comes to Java, some profilers depending on JVM Tool Interface (JVMTI) need to wait for all threads to go to a safe point to be able to collect a stack trace sample, causing application pause time. In contrast, some others that use the HotSpot JVM AsyncGetCallTrace interface do not incur safe point bias. It is also essential to consider licensing costs when an open-source profiler is a great fit. Considering the above points, we opted for Java Flight Recorder (JFR) and Async Profiler as they support the sampling profiling technique and have no safe point bias.

The following are a few observations while evaluating JFR and Async Profiler with an application holding several thousands threads serving requests, a very large heap, and tens of thousands of classes.

JFR method profiling at 10ms frequency did not show any noticeable application degradation.
JFR Socket I/O profiling generated too many events with a default 20ms threshold.
JFR Java native profiling did not show any noticeable application degradation, however it added significant size.
JFR and Async Profiler both showed application performance degradation while collecting memory profiles, and did not seem fit for continuous production profiling.
JFR Java Monitor wait/park generated too many events causing data overhead.
Async Profiler method profiling did not show any noticeable application degradation. However, on-demand agent attachment took a few minutes when many class loading and unloading were present.
Async Profiler native profiling did not show any noticeable application degradation.

The results can be different for another application, but a similar evaluation approach is critical to finalize a profiler and what data can be collected without application disruption. After evaluating the pros and cons, the following optimal configuration is used.

JFR Continuous method profiling at 10ms frequency to analyze CPU regressions.
JFR Continuous Socket I/O profiling at 200ms threshold (traces any individual socket operation that is taking more than 200 ms).
JFR Continuous Java native profiling at 10ms frequency, which includes socket sampling (vs. tracing above).
JFR on-demand live object samples to analyze memory bottlenecks.
Thread dumps (incur safe point bias) using Jstack once every 1 min to analyze long-running transactions and deadlocks.
Async profiler on demand for method and native profiles to analyze issues in calls made to the Operating system.
Async profiler on-demand memory profiling to analyze memory bottlenecks.

Handling large profile data volumes
One of the critical challenges while collecting profiling data is volume. Profiles tend to be very large (5 minutes of collection from a single application resulting few hundred Mb), making it difficult to store, retrieve and visualize at a later time. We solved this problem in three ways, 1. by storing 10 min window aggregated profiles, 2. hashing frame strings and 3. by applying a threshold on the aggregations. The aggregation reduces duplicates, and the threshold will ignore stack traces that occur less than a number that is not considered to be a significant contributor. After several experiments, we devised the data structure shown below and used object serialization to generate aggregated json blobs. The data structure is built with indexes to facilitate faster query abilities, like getting a given stack or navigating the profile. This also helped to reduce profile size by 10 times.

data structure: (refer open source perfGenie for more implementation, scale and s3 storage details)

HotspotBlob{
Node trie; // structure to hold paths
Long start;// stacktrace sample start time
Long end;// stacktrace sample end time
Map<Integer, String> nodevalueMap; //node value hash to value
Map<Integer, List<StackidTime>> tidMap; //tid and stacktrace sample timeline
Map<Integer, List<CustoEvent>> custoevents; //map of pid and list of custom events
}
Node{
int hash; //node frame value hash
int sz = 0;//node occurance size
int sf = 0;//did the path end at this node?
List<Node> nodes; // structure to hold paths
Map<Integer, Integer> sm; // index to unique stacktrace, denote start and end of path
}
StackidTime {
Integer hash;
Integer time;
}
TidTime {
Integer tid;
Integer time;
}

Attaching application context to profile samples
In large monolithic applications, the transactions could be from many features, users, or organizations. It is also critical to narrow down profiles for a specific segment, organization, user, etc., to investigate a given problem. Attaching contextual data (feature, organization or user, etc.) to a profile of profilers collecting samples at high frequency will affect application performance if we try to add context to every sample. To solve this problem, we took an alternative approach and attached context to transaction spans as JFR custom events. Using a post-processing task, the context is attached to each sample by mapping time and thread id to transaction span events. Once samples are associated with the application context, it will be easy to filter them the way we want.

The following are some of the screen shots of context filtering features available in open source project perfGenie

Context filters support using JFR custom event spans.

Profile view for a single request transaction

Comparative analysis
CPU and memory profiles collected are visualized in the form of aggregated call tree/ backtrace tree/ flame graph for a given time window. However, these visualization does not provide a timeline of stack trace changes to tell what code path is precisely contributing to CPU or memory spike at a given time. The standard techniques used to find regressions are comparing with a baseline profile and visualizing a tree diff or flame graph diff view which are supported in perfGene. However, a new approach showing the time dimension in profile visualization will help identify any regressions. The project perfGenie has two innovative time-dimension visualizations, river and hotspot surface views, which can help to correlate CPU / Memory trends with profile changes and identify regressions.

Aggregation of profiles for a longer period and aggregation of profiles across large number of similar deployments
Visualizing profiles for a longer time range and combining a large number of profiles collected from similar deployments is critical to understand the breadth of impact of an application. However, this becomes more challenging when the profile size is too large. The blob structure and the threshold approach helped solve this problem. PerfGene can select multiple hosts and a more extended time range to visualize the profile.

Conclusion
Continuous production profiling is critical to solving a variety of problems in application deployments. The volume of data and visualizations is challenging. However, perfGenie has an optimal solution using compressed blobs, thresholds, and profile visualization with a time dimension.

What is next?
As the profiling solutions evolve, we need zero overhead profiling solutions to understand applications’ CPU and Memory aspects. It will be interesting to see how JFR streaming will help solve some of these challenges. Though the new timeline views permit the identification of regressions, AI correlation models should be explored to automate reducing noise.

Credits to: Ravi Pulle, Eric Gross, Laksh Venka, Yuhan and Ajay

Low Overhead Continuous Contextual Production Profiling

Written by Ravi Pulle