Engineering High-Throughput, Low-Latency Machine Learning Services
Authors: Shirsh Bansal and Divyanshu Goyal
Adobe’s User Response Prediction Service is a flexible AI-based service that predicts business outcomes based on a variety of user signals. Consumers of this service can create custom prediction models, taking inputs such as user spatial and temporal data, and user device information. The models can then be used to recommend products, optimize customer experiences, or rank applications. The underlying Prediction Engine service must process about 3,500 requests per second with a round-trip time below 5 milliseconds (99th percentile). This translates to an in-app processing time of only 1 or 2 milliseconds. Because we have so little time to work with, when building this service we came up with the following constraints that need to be carefully tracked:
1. Required API throughput and latency
2. Algorithm’s time and space complexity
3. Maximum model size (latency is directly proportional to model size) and maximum number of models
4. API contract (this is really crucial as it governs the maximum payload size, content type, and protocol)
5. Infrastructure used (host size, capacity, and network bandwidth)
6. How to generate miscellaneous events for logging, monitoring, feedback, and model operationalization
Each of the above constraints posed unique challenges for us. In this post, we will discuss many of these, but will skip infrastructure. Although it had a significant impact on the service design, we wanted to focus on the other constraints.
First off, we did performance benchmarking for various webservers and data transfer protocols. This helped us in determining the optimal tools required for the task and finalizing the API contracts. We decided to design a J2EE application and use REST APIs over HTTP with JSON payloads for data transfer. We finalized to deploy these on Tomcat to serve the application.
Code optimization and benchmarking
By setting up our constraints up front, it was clear to us that we would need to focus on optimizing the code. This was needed to manage the algorithm’s time and space complexity and the model size. This made granular code profiling and load simulation imperative for us, but the question remained on how to gauge the code performance at such granularity.
We took the following steps to help benchmark and optimize our code:
1. Code profiling: We collected profile data containing stack traces using samplers like perf (Linux) and DTrace (Mac OS X). We then visualized them using flame graphs, which we found to be an effective way to quickly identify hot code-paths. There are different types of flame graphs. We used CPU Flame Graphs to visualize the CPU cost paid for each method call in the app. The results encouraged us to use correct data structures and algorithms, dump unnecessary libraries, and run the code in microseconds. Below images are the glimpse of what we achieved from this.
The marked section of the images represents the actual algorithm execution while the rest represents miscellaneous Tomcat and JVM activities. We did multiple iterations of code profiling to trace through these stacks at each level to improve the code performance. The decrease in width of the marked rectangle in the image denotes the improvement in performance.
2. Micro-benchmarking: We used JMH for micro-benchmarking as it took care of things like JVM warmup and code optimization paths for us. This proved to be a much better way for comparing the impact of small code changes by providing performance numbers, instead of ascertaining the same thing via full scale load test and simulation which just gives a holistic view. For example, we were able to compare the performance of Java’s random number generation implementations, Random and ThreadLocalRandom, and decided which were best for our solution.
3. Load test and simulation: Apache JMeter helped us to simulate the traffic load patterns we were expecting to receive from the client side. This helped us to conclude average round-time latency, 99th percentile latency, and various other statistics. Some of the salient features were adjusting the number of requesting threads, connection reuse, random requests, and constant throughput timer. Knowing our requirements upfront, and designing the API contracts ahead of time, was paramount in successfully running the benchmark. It would have been impossible to simulate the right amount of traffic or even commence the load test without knowing about payload content type, max payload size, and model size in advance. We finalized changes to the API contract as we completed iterations of load simulation and testing.
4. Metrics: For visualizing both system and application metrics we used New Relic, an enterprise product for software analytics. Its integration with our app had almost negligible impact on garbage collection (GC) and network.
Besides our code to calculate predictions, we were also generating lots of events for logging, feedback, and model evaluation. Our event generation rate is very high, as we create an event for every request. Addtionally, there are logging events in case of exceptions, model refresh, and service initialization. Processing all these events inline would have put back-pressure on our main task of prediction calculation because event processing could be both CPU and I/O intensive. In order to avoid this situation, we needed a queue or buffer with below features:
1. Lock free: to provide fast message-sharing capability between threads.
2. Scalable: to handle back pressure due to slow consumption or failures.
3. Reusable: to prevent GC overhead.
We found all the desired features in the form of a ring buffer in the disruptor library, inspired by the principle of mechanical sympathy (i.e., understanding the hardware to create more efficient software). The implementation helped us to overcome the concurrency hazards like false sharing and had a great positive impact on our app’s performance. To handle the processing of so many events, we decided to process them in batches. This allowed us to efficiently decouple event processing threads from HTTP handling threads. We used two different consumer pools to independently service these two types of events. With these changes, we can now maintain the required throughput for both the producer and consumer. The figure below illustrates the resulting service design:
Garbage collection tuning
Our service is not just CPU and network intensive but also memory intensive which leads to garbage collection concerns. The processing logic creates a large number of short-lived objects. This affects the application in two ways: 1) time spent in memory allocation and then deallocation, and 2) Stop the World GC Pauses. GC pauses can cripple any Java application’s performance.
We worked around these garbage collection concerns by doing the following:
1. Being very thrifty with object creation: Wherever possible, objects have been reused by declaring them as Threadlocals, maintaining elegant code while at the same time using proper design patterns.
2. Finalizing the max limits in API contract: Parameters like max request payload size, max number of ML models, and max size of ML models govern our application’s RAM requirements. Hence finalizing such max limits in the API contract gave us a better idea of the number of objects we have to support in the worst case. Consequently, we were also able to do more granular algorithm space complexity analysis. This further helped us in fixing important GC parameters like Java’s Generation sizes.
3. Having GC-free implementations: Certain implementations, like event generation and processing, are completely GC free. We achieved this by predetermining and configuring the character buffer sizes based on requirement analysis and sharing their contents between threads via array copy instead of creating multiple short-lived String instances. We implemented our own JSON map builder to create JSON objects instead of using libraries. Building our own facilitated swift GC-free JSON conversion. While pushing messages to external systems, we are directly writing byte-by-byte to the OutputStream of HttpUrlConnection. Such measures prevented the creation of multiple short-lived objects.
4. Fine tuning GC parameters: Most of the tenured objects were surviving until the end of the application. Since we create a large number of such tenured objects, a small Young Generation size led to frequent GC pauses when we would start the app. To address this we increased the size of the Young Generation. Accordingly, we also increased the size of Tenured Generation. We found that increasing the size of the Young Generation delayed the Memory Allocation Failure time with only a minor increase in application pause time and hence reduction in the GC pauses. After multiple iterations of load testing, we brought down our cumulative GC pauses for every 20 seconds to ~30 milliseconds!
5. Using memory profiling tools: Tools like JMap were very handy in GC tuning. JMap let us exactly calculate the counts and sizes of tenured objects and estimated sizes for the various Java Generations.
After defining our constraints upfront, and then tuning our code accordingly, we can now serve the prediction requests at 3500 queries/second/node with app side latencies of 0.3–0.6 ms (99th percentile)!
This required multiple iterations, but during each we used optimal data structures and algorithms for swift code execution, lock-free implementations, optimal memory usage, and GC tuning. To ascertain performance improvement, we concluded each of the iterations with performance benchmarks and code profiling. Improvements are likely still possible, by exploring other areas like network settings, different protocols, and GC algorithm comparisons.