How to Continuously Profile Tens of Thousands of Production Servers

Paul Howden
Salesforce Engineering
9 min readJan 21, 2020

by Paul Howden and Paymon Teyer

Photo by Taylor Vick on Unsplash

The Salesforce CRM application is a multi-tenant monolith running on JVMs on tens of thousands of servers throughout our production environments. These hosts serve billions of requests per day, handling reporting queries and various synchronous or asynchronous tasks for different tenants. Thousands of engineers make changes to the mono-repository source code daily. In addition, a wide range of changes are made to the underlying data as well as user-defined extensions on top of Salesforce CRM that can result in observing non-uniform behavior amongst different servers/clusters in production.

Challenges and Solutions

When features break or performance regresses, determining exactly what was happening at critical points during the day is crucial for debugging non-optimal behavior. Because of the fact that a single monolithic application is responsible for handling so many varieties of requests for different tenants, there is a wealth of contextual data (e.g., tenant IDs, users, URLs) attached to each worker thread which must be maintained in order to be able to quickly pinpoint a problem in production.

We are the team at Salesforce responsible for ensuring that the right diagnostic data is available from every production server at all times. We have developed a fully built in-house Application Performance Management system, continuously capturing profiling and diagnostic data from all production servers.

We faced a number of problems that may be unique to the Salesforce application, however, we believe the solutions to these problems may be useful to engineers building similar systems. This post briefly describes some of the main challenges and the solutions we implemented to address them.

Scalability

Challenge

The Agent runs on tens of thousands of JVMs. The combination of having high host count per data center and a large amount of data captured per host as well as cross-data center network latency means that we were not able to efficiently persist all data to a centralized storage solution. Across all servers, this amounts to over 3 million writes per second, each containing a few kilobytes of data, totaling more than 1GB per second. These rates are too high for a single network or storage solution to handle (at a reasonable cost).

Solution

Distribute the load across multiple data centers, and then coordinate retrieval from a centralized hub site that has access to all DCs and is aware of how to route requests to a specific data center. User requests specify which cluster of hosts they want profiling data for, and the hub site routes requests through to the corresponding servers within the correct data center. This allows a seamless experience for investigating engineers, relying on a single hub site to view all profiling data across the site. We maintain a routing lookup table in storage (which can be modified at runtime by system admins) to map clusters to their corresponding data center.

Fault-tolerance

Challenge

Many of the most important or interesting time periods for JVM profiling are during times of extreme conditions, either as causes or symptoms. Memory and CPU skyrocket when slow requests pile up, customer workload patterns change, or greedy jobs allocate too many large objects. In times of crisis, it is likely that JVMs will die or be terminated, or network connectivity may be lost; therefore, the profiling data is likely to be lost if it’s simply buffered in memory and is not immediately persisted into some form of permanent storage.

Solution

Buffering data in a resilient manner is required to avoid losing the most important data while maintaining the ability to persist data in batches. Keeping the buffer in memory might be prone to data loss when the application is under duress, so we implemented an on-disk circular buffer. Samples (thread dumps and associated contexts) are persisted on local disk immediately after they are captured from the JVM. This prevents losing buffered data should the server or network go down before it can be forwarded to the centralized storage. The circular nature of the buffer is required to prevent it from negatively impacting disk space in the event of long outages, so the buffer is overwritten based on configurable intervals.

Multi-language Runtime Support

Challenge

Salesforce employs a custom interpreter for the Apex programming language that customers use to add custom business logic specific to their organization. Being able to profile user-defined extensions is of great value when dealing with production issues, trying to reduce cost-to-serve or response times. Our solution must be able to capture and represent profiling and observability data regardless of the underlying language. Additionally, Salesforce runs a variety of JVM based services, and many of these are good candidates for profiling. So our solution must be able to fit into a broad range of JVMs, rather than working just with the CRM monolith.

Solution

The design of our system does not have any assumption on the programming language used. The underlying implementation language is attached as yet another metadata to all profiling data points, allowing our users to query based on the language. Additionally, data structures used to abstract out stack traces and metadata are represented in a way that is generic and flexible enough to be able to support different languages and environments in order to make this multi-language support.

Contextual Metadata

Challenge

More often than not, investigations and debugging are triggered given some domain-specific context: a report is taking longer than it should, a request failed or took too long, a tenant-specific performance issue or a page loaded with malformed data. Allowing the debugging engineer to use this domain-specific context to drive their investigation can greatly speed the process towards a conclusion. If the engineer already knows the URL that took too long to load, searching for stack traces relevant to that URL allows them to quickly narrow down their view to just the relevant data. This cuts out many of the common beginning steps of investigations, like searching logs and cross-referencing URLs against thread ids.

Solution

Our implementation gathers domain-specific context for each thread and stack trace it collects. We designed the process to be generic enough so that each profiled service can determine what relevant information is mapped to each thread sample: a JVM that handles requests may attach URLs, HTTP methods, and request parameters; a JVM that runs batch jobs may attach job names, IDs, and job types. Domain-specific context is stored alongside the standard profiling metadata, is fully indexed, and allows the appropriate filters to be added when queried, to avoid additional noise.

Additionally, we allow deep search into stack traces where users can look up samples that contain frames matching a regular expression. This has been one of the most useful and popular query parameters, due to the fact that, in many cases developers, are only interested in certain code paths as opposed to samples from all the code executing in the application. Feature teams know the entry point (Interface/API) for their modules, so using this knowledge allows them to validate how their feature behaves in production, providing a feedback loop that identifies potential further optimization opportunities.

High Thread Counts

Challenge

The Salesforce application maintains a large number of active thread pools to allow surges in requests or activity to be handled in a timely manner. This results in a certain percentage of live threads that are consistently idle and not doing any appreciable amount of work. Various similar circumstances multiply this effect across servers, and as such, any given thread dump contains at least 1500 individual threads. Data from these threads would quickly overwhelm our storage infrastructure.

Solution

Throw out the data captured for those idle threads! Our goal is not a perfect representation of each thread at a given moment, but rather a representation of the work that is being done at a given moment. Filtering out threads that are waiting idly for work, or sleeping between checking a value, allows us to profile more often and persist the data for longer. In a given JVM we are able to filter out up to 99% of the initial thread dump. For reference: over our entire production landscape we throw away hundreds of millions of stack traces per minute on average, and only keep 5 million per minute to persist later.

Compression and Deduplication

Challenge

Code bases change often, but as a percentage of the whole, they don’t change a large amount. Additionally, the most commonly executed code is largely unchanged in any given time period. As such, thread dumps contain a large percentage of duplicate data. This presented a challenge, storing profiling data as flat captured would result in huge storage requirements that would not be justifiable.

Solution

Splitting the data into discrete parts, and keeping track of relationships of the data, allowed us to store thread dumps in a format that minimizes duplication on disk as much as possible. Choosing the right storage solution, and the right storage schema was of utmost importance in order to construct a system that could reduce the duplication in order to make storing this data feasible. We constructed related tables using Apache Phoenix on top of HBase. Individual frames of stack traces are only stored once per data center and are indexed for fast lookup. Stack traces are also stored only once. Stack traces are stored as a list of their frame IDs and indexed on their hashed ID value. From a given thread dump, individual thread samples are stored as time-series events with the stack trace recorded as a reference to the hashed ID value. All of this combines to allow us to store the stack trace (easily the largest part of any given thread dump) in a much smaller footprint.

To reduce our write load on the HBase cluster, we keep configurable caches of stack traces and frames. Rather than writing each stack trace or frame, these caches are checked to see if the stack trace was already written. These caches eliminate over 99% of frame writes and 75% of stack trace writes.

Regression Identification

Challenge

Regression detection and identification after making changes to an application (for example, a new release) is one of the main problems we would like to address. Regressions can be either due to an increase in frequency of calls to certain code path, or due to degradation in runtime performance of a subsystem.

Solution

By persisting and maintaining profiling data for longer periods of time (we currently have a TTL of 90 days), our system provides support for doing comparative analysis between different releases, library and platform upgrades, and other changes to the software that take effect over longer periods of time.

Comparative analysis can be done at runtime for short time periods (less than one hour). The data is represented as a flame graph and a tree diff. These visualizations allow engineers to easily identify differences in code paths, making it easier to pinpoint culprits during the time frames they’re investigating.

Additionally, for queries that expand over longer time periods, since going through this massive amount of data to identify regressions is not feasible at runtime, we have developed a general purpose jobs and reporting framework where users can define and schedule jobs to be executed on top of all data sets, and the results of the jobs are persisted and can be viewed from the UI. The jobs framework has built-in support for interacting with other internal tools at Salesforce, making it possible to correlate our profiling data with logs, system and application level monitoring data, Site Reliability tools, etc.

Future Work

We are looking into integration with Java Flight Recorder to add CPU profiling and support for much higher rate of sampling. This will allow us to accurately measure the amount of CPU cycles spent on different code paths which can help us to associate a cost to serve value for individual components in the system. We are additionally looking into integration with Async Profiler to provide better support for non-JVM applications.

Our next blog post will cover our approach to capturing observability data with our diagnostics data collection agent. The blog post will detail how we designed a low overhead, configurable agent to collect system and application diagnostics data from production servers.

Finally, if you’re interested in working with us on challenging engineering problems like this, feel free to check out our open positions in software engineering.

Credits go to the APM team at Salesforce: Brian Toal, Paymon Teyer, Paul Howden, Dean Tupper, Alexander Kouthoofd, Laksh Venka.

--

--