Engineering High-Throughput, Low-Latency Machine Learning Services

Shirsh Bansal
May 28, 2019 · 7 min read

Authors: Shirsh Bansal and Divyanshu Goyal

The Challange

1. Required API throughput and latency

2. Algorithm’s time and space complexity

3. Maximum model size (latency is directly proportional to model size) and maximum number of models

4. API contract (this is really crucial as it governs the maximum payload size, content type, and protocol)

5. Infrastructure used (host size, capacity, and network bandwidth)

6. How to generate miscellaneous events for logging, monitoring, feedback, and model operationalization

Each of the above constraints posed unique challenges for us. In this post, we will discuss many of these, but will skip infrastructure. Although it had a significant impact on the service design, we wanted to focus on the other constraints.

Design choices

Code optimization and benchmarking

We took the following steps to help benchmark and optimize our code:

1. Code profiling: We collected profile data containing stack traces using samplers like perf (Linux) and DTrace (Mac OS X). We then visualized them using flame graphs, which we found to be an effective way to quickly identify hot code-paths. There are different types of flame graphs. We used CPU Flame Graphs to visualize the CPU cost paid for each method call in the app. The results encouraged us to use correct data structures and algorithms, dump unnecessary libraries, and run the code in microseconds. Below images are the glimpse of what we achieved from this.

Figure-1 Before Code Optimization
Figure-2 After Code Optimization

The marked section of the images represents the actual algorithm execution while the rest represents miscellaneous Tomcat and JVM activities. We did multiple iterations of code profiling to trace through these stacks at each level to improve the code performance. The decrease in width of the marked rectangle in the image denotes the improvement in performance.

2. Micro-benchmarking: We used JMH for micro-benchmarking as it took care of things like JVM warmup and code optimization paths for us. This proved to be a much better way for comparing the impact of small code changes by providing performance numbers, instead of ascertaining the same thing via full scale load test and simulation which just gives a holistic view. For example, we were able to compare the performance of Java’s random number generation implementations, Random and ThreadLocalRandom, and decided which were best for our solution.

3. Load test and simulation: Apache JMeter helped us to simulate the traffic load patterns we were expecting to receive from the client side. This helped us to conclude average round-time latency, 99th percentile latency, and various other statistics. Some of the salient features were adjusting the number of requesting threads, connection reuse, random requests, and constant throughput timer. Knowing our requirements upfront, and designing the API contracts ahead of time, was paramount in successfully running the benchmark. It would have been impossible to simulate the right amount of traffic or even commence the load test without knowing about payload content type, max payload size, and model size in advance. We finalized changes to the API contract as we completed iterations of load simulation and testing.

4. Metrics: For visualizing both system and application metrics we used New Relic, an enterprise product for software analytics. Its integration with our app had almost negligible impact on garbage collection (GC) and network.

Event generation

1. Lock free: to provide fast message-sharing capability between threads.

2. Scalable: to handle back pressure due to slow consumption or failures.

3. Reusable: to prevent GC overhead.

We found all the desired features in the form of a ring buffer in the disruptor library, inspired by the principle of mechanical sympathy (i.e., understanding the hardware to create more efficient software). The implementation helped us to overcome the concurrency hazards like false sharing and had a great positive impact on our app’s performance. To handle the processing of so many events, we decided to process them in batches. This allowed us to efficiently decouple event processing threads from HTTP handling threads. We used two different consumer pools to independently service these two types of events. With these changes, we can now maintain the required throughput for both the producer and consumer. The figure below illustrates the resulting service design:

Figure-3 Prediction Engine Service Design

Garbage collection tuning

We worked around these garbage collection concerns by doing the following:

1. Being very thrifty with object creation: Wherever possible, objects have been reused by declaring them as Threadlocals, maintaining elegant code while at the same time using proper design patterns.

2. Finalizing the max limits in API contract: Parameters like max request payload size, max number of ML models, and max size of ML models govern our application’s RAM requirements. Hence finalizing such max limits in the API contract gave us a better idea of the number of objects we have to support in the worst case. Consequently, we were also able to do more granular algorithm space complexity analysis. This further helped us in fixing important GC parameters like Java’s Generation sizes.

3. Having GC-free implementations: Certain implementations, like event generation and processing, are completely GC free. We achieved this by predetermining and configuring the character buffer sizes based on requirement analysis and sharing their contents between threads via array copy instead of creating multiple short-lived String instances. We implemented our own JSON map builder to create JSON objects instead of using libraries. Building our own facilitated swift GC-free JSON conversion. While pushing messages to external systems, we are directly writing byte-by-byte to the OutputStream of HttpUrlConnection. Such measures prevented the creation of multiple short-lived objects.

4. Fine tuning GC parameters: Most of the tenured objects were surviving until the end of the application. Since we create a large number of such tenured objects, a small Young Generation size led to frequent GC pauses when we would start the app. To address this we increased the size of the Young Generation. Accordingly, we also increased the size of Tenured Generation. We found that increasing the size of the Young Generation delayed the Memory Allocation Failure time with only a minor increase in application pause time and hence reduction in the GC pauses. After multiple iterations of load testing, we brought down our cumulative GC pauses for every 20 seconds to ~30 milliseconds!

5. Using memory profiling tools: Tools like JMap were very handy in GC tuning. JMap let us exactly calculate the counts and sizes of tenured objects and estimated sizes for the various Java Generations.


This required multiple iterations, but during each we used optimal data structures and algorithms for swift code execution, lock-free implementations, optimal memory usage, and GC tuning. To ascertain performance improvement, we concluded each of the iterations with performance benchmarks and code profiling. Improvements are likely still possible, by exploring other areas like network settings, different protocols, and GC algorithm comparisons.

Follow the Adobe Tech Blog for more developer stories and resources, and check out Adobe Developers on Twitter for the latest news and developer products.

Adobe Tech Blog

News, updates, and thoughts related to Adobe, developers…