Rocket Growth’s ML platform: Handling high traffic and more than 20 models cost-efficiently

How Coupang simplified and extended the large-scale ML platform with an orchestration layer

Coupang Engineering

Published in

Coupang Engineering Blog

8 min readSep 23, 2022

By Harsha Konda

This post is also available in Korean.

Rocket Growth is a Coupang Fulfillment business that enables sellers to offer the best-in-class delivery and post purchase experience to customers. As more sellers leverage the power of rocket delivery and sales analysis with Rocket Growth, some of the manual processes such as quality checks when registering new products were quickly becoming bottlenecks in the system.

As an effort to remove this bottleneck and improve system scalability, we applied machine learning (ML) techniques to automate manual decisions and speed up decision making. During this effort, we tackled numerous performance problems with improvements in system architecture such as latency, fault tolerance, and scalability.

We will take a step by step approach to discuss how the architecture was evolved for each problem at hand, and hopefully document a few common patterns that you can leverage for your existing ML services.

· Initial architecture
∘ Advantages
∘ Shortcomings
· Orchestrating high latency ML models
· Speeding up model inference
∘ Benchmarks
· Conclusion

Initial architecture

The Rocket Growth ML platform was tailored to mimic the human auditing process, which typically involved checking multiple features of a seller’s item. For instance, a quality check for an item entails identifying the brand’s authenticity, detecting the certification logo, extracting text from the item’s image to filter any inappropriate text, and more.

We leverage multiple ML models to gather data and make the quality check judgement automatically. A client or feature calls the models of interest, waits for the response, and makes a decision based on the model outputs.

We set the following three key requirements for our ML platform:

Handle high traffic: The ML modules were projected to serve 5 million requests per day.
Low latency: Each of the models should take anywhere from 1 second to 2 minutes in order to process a request.
Fault-tolerance: The setup should be able to handle intermittent failures and retries.

To meet the above requirements, each ML module component was exposed through a Kafka topic, to which the client or feature publishes their request. The component consumes the messages from this topic, calls the ML model for the inference output, and finally publishes the output as a message to a result Kafka topic. Clients can listen to the result topic to filter out irrelevant requests using the request ID and aggregate the filtered responses.

Initial architecture of the Coupang Rocket Growth ML platform — ***Figure 1.*** *The initial architecture of our ML module platform*

Advantages

This initial ML platform architecture had two main advantages.

First, it was asynchronous. Each feature could depend on an average of 5–10 models, and we needed to account for the overall latency of the combined responses. Serving these in REST services meant the client needed to figure out how to call each of the dependent modules, wait for a response (which could typically take 5 min), and handle retries in case of failures. Wrapping these models with a Kafka topic solved the issue of retries, because each failure on the model level would be retried within a consumer. In addition, it shifted the method of communication from synchronous to asynchronous where the client does not have to wait for all the requests to be completed but simply listen to the output result topic.

Secondly, it is loosely coupled. Each ML module was decoupled from one another by having its own Kafka topic that a client can interact with. Using this pattern, the team could independently deploy each model running on a container orchestration system. Furthermore, production issues were isolated within the respective components.

Shortcomings

Although the initial architecture fulfilled requirements 1 and 2 discussed above, it had a few key shortcomings:

Weak abstraction layer: By decoupling the modules, job orchestration was delegated to the clients. This meant that to add a new feature, the client had to write logic to publish to all the Kafka topics of interest and figure out a scheme to aggregate the response.
Cumbersome to add new models: Creating a new ML model also meant creating a new topic and figuring out how to publish to the result topic.
Increased latency: The requests are sent through Kafka topics which add an additional delay for publishing and consuming from a Kafka topic downstream.

The following sections discuss how we solved these shortcomings in a cost-efficient way.

Orchestrating high latency ML models

To improve the client’s experience, we introduced an orchestration layer that manages the client interactions with the ML models. A client calls the orchestration layer to submit a request, which is then routed to the respective models. Finally, this response is aggregated and returned to the client.

In the initial architecture, the engineer had to create a new Kafka topic, consume from it, and publish it to a result topic in order to build a new ML model. We wanted to simplify this experience. One way we could do this would be to fall back on using REST services to interact with the models. However, to avoid the high cost of I/O associated with the high latency of REST services we introduced a non-blocking I/O (NIO) orchestration layer.

The orchestrator handles the routing requests that had previously been handled by Kafka topics, eliminating the need for Kafka topics entirely. The orchestrator figures out which features to call for a given client, waits for the responses, handles retries, and sends the combined responses back to the client.

Here’s how the client and orchestrator work together:

Client calls the orchestrator: The client submits a request with the request ID, execution type, detections, and data. The detections property indicates which ML models the client wants to utilize. The data property informs the orchestrator of all the corresponding data input that the models should use. A typical request is shown below.

{
    “requestId”: “01a66e3c-c168–407d-bfbe-9729f5ebbbd1”,
    “execution”: “sync”,
    “detections”: [“keyword_detection”,”brand_detection”,       
                   “logo_detection”],
    “data”: [{“imageurl”: “http://url-link1"}]
}

Orchestrator executes the client request: Depending on the execution property given in the request, the orchestrator decides whether to keep the request connection alive in case of sync or respond immediately and publish the asynchronous outcome to a result topic in case of async. Retries can be handled using some of the reactor core abstractions such as Mono and Flux.

The Coupang Rocket Growth ML model platform orchestrator — ***Figure 2.*** *For low latency and improved client experience, we added an orchestrator to our ML module platform.*

Using this pattern, we expose a clean interface for the clients and significantly reduce the cost of I/O associated with waiting on the response from ML models. In the next section, we will address how we improved the speed of these models.

Speeding up model inference

We initially chose a container orchestration system as the computation platform for all our ML models. It has advantages such as ease of deployment and mature monitoring tools. However, it is ill-suited for running models that involve a plethora of matrix computations, such as computer vision-based models. Hence, utilizing GPU instances to run these models was an obvious choice, even considering the additional cost.

For cost efficiency in the production setup, we compartmentalized the ML modules into two buckets: GPU-based modules that perform better on discrete GPUs, and CPU-based modules that are cheaper to operate on general purpose instances. The orchestrator automatically maps the modules to their respective bucket.

With this system, we were able to leverage the powerful GPU instances but keep costs relatively low. For instance, computer vision models such as optical character recognition and logo detection benefited from running on GPUs, whereas some NLP models such as keyword detection, age detection were more efficient on CPUs.

The compartmentalized Coupang Rocket Growth ML module system that enables fast calculations at relatively low costs. — ***Figure 3.*** *The compartmentalized ML module system that helped us speed up calculations at relatively low costs.*

Here are some key findings that helped us improve the compartmentalized structure:

New generation Graviton instances. Using AWS Graviton helped us significantly reduce the cost of instances when compared with Intel based P3.
Spot instances. Spot instances are known to be significantly cheaper compared to the regular on-demand instances. However, they can be terminated unexpectedly. To avoid potential outage, we set up an auto-scaling group with multiple instance types where you can setup different fallback rules if the desired spot capacity is unavailable. For more details, refer the official AWS guide.
Single threaded server. When running a web server on a GPU instance for interacting with ML models, we found that we got the best performance when setting the number of worker threads to 1. Based on our investigation, more than one worker caused far more concurrent operations, which resulted in allocation and deallocation of memory on the GPU for tensor weights. This led to more memory copy, dramatically increasing the request time.

Benchmarks

Latency: Running on GPUs with some of the optimizations mentioned above resulted in 20–40 times speed improvements. Below is one of the benchmarks for the latency of optical character recognition model on CPU vs. GPU in seconds.

Cost against latency in seconds for a set of GPU instance configurations at Coupang Rocket Growth ML platform — ***Figure 4.*** *Benchmark for the latency of optical character recognition model on CPU vs GPU in seconds*

Cost: The graph below depicts the cost against latency in seconds for a set of GPU instance configurations that we experimented with. In our tests we found that G5 instances had the best performance for the lowest cost. By combining some of the methods outlined in this post, we saved 95% in comparison to our initial projected cost.

Latency of optical character recognition model at Coupang Rocket Growth ML platform — ***Figure 5.*** *Cost against latency in seconds for a set of GPU instance configurations*

Conclusion

To summarize the post, we covered several patterns such as:

Orchestrating high latency models with non-blocking I/O
Optimizations for running servers on GPU, which involves picking the right instance considering performance vs. cost analysis and running with a single worker

Using these methods, our ML platform scaled to millions of daily requests and integrated 20 ML models in the span of 2 months. However, there are more challenges ahead, such as distributing these models across instances to optimize for memory capacity and developing a feedback loop that can improve the models.

If any of the issues raised in this post interest you, come join our team!