Meet Coupang’s Machine Learning Platform

How Coupang’s ML Platform accelerates ML development for Coupang products

Published in

Coupang Engineering Blog

9 min readSep 8, 2023

By Hyun Jung Baek, Hara Ketha, Jaideep Ray, Justina Min, Mohamed Sabbah, Ronak Panchal, Seshu Adunuthula, Thimma Reddy Kalva, and Enhua Tan

This post is also available in Korean.

Introduction

Coupang is reimagining the shopping and delivery experience to wow all the customers from the instant they open the Coupang app to the moment an order is delivered to their door. In addition to e-commerce, Coupang has various other consumer services ranging from Coupang Eats for food delivery, Coupang Play for video streaming, Coupang Pay for payments, and to Coupang Grocery for fresh products amongst others.

Machine Learning (ML) impacts every aspect of e-commerce experiences of Coupang customers: the product catalog, search, pricing, robotics, inventory, and fulfillment. As Coupang ventures into new markets, ML has continued to play an even more important role.

ML helps to power search and discovery across Coupang websites and apps, to price products and services, to streamline logistics and delivery, to optimize content for streaming, to rank ads, and to do many more jobs.

Therefore, we strive to scale machine learning development at all ML lifecycle stages, including ad-hoc exploration, training data preparation, model development, and robust production deployment of models.

· ML @ Coupang
· Motivation
∘ 1. Reduce time to production
∘ 2. Incorporate CI/CD in ML development
∘ 3. Scale ML compute efficiently
· Core offerings of Coupang ML Platform
∘ 1. Notebooks & ML Pipeline Authoring
∘ 2. Feature Engineering
∘ 3. Model Training
∘ 4. Model Inference
∘ 5. Monitoring & Observability
∘ 6. Training & Inference Clusters
· Success Stories
∘ 1. Training Ko-BERT to understand search queries better
∘ 2. Real-time price forecasting of products

ML @ Coupang

ML teams at Coupang are actively developing models in Natural Language Processing (NLP), Computer Vision (CV), Recommendations, and Forecasting. NLP is used to understand search queries, product listings, and ads content. Computer vision-enabled image understanding categorizes similar products and ads. Recommendation models rank content for product search, videos in Coupang Play, and product ads. Forecasting techniques help us understand supply, demand, and pricing for millions of products.

This post introduces Coupang’s internal ML platform and describes how the platform supports the increasing scale and diversity of workloads across ML frameworks, programming languages, different model architectures, and training & serving paradigms.

Motivation

The motivation behind Coupang ML Platform is to provide ‘batteries-included’ services to accelerate ML development through improved developer productivity.

Core services include managed notebooks (Jupyter), pipeline SDK, feature-store, model training, and model inference. ML teams can use the services independently to compose their ML pipeline. Our focus areas are as follow:

1. Reduce time to production

Before Coupang ML Platform, authoring and training a ML model required hours of non-trivial setup work and boilerplate code for preparing data, features and writing trainer code. Tasks like scaling model training through distributed training, using GPUs took deep engineering work resulting in duplicate stack.
Deploying the ML model for serving real-time traffic took weeks of effort, replicating logic for model benchmarking, auto-scaling, security and rollback. These were blockers for product groups to adopt ML at a larger scale. By leveraging ML Platform lifecycle services, one can train, debug and deploy simple to complex models in production within days in a standardized way.

2. Incorporate CI/CD in ML development

ML development can quickly incur heavy technical debt. To make it easier for ML teams to build, deploy and maintain models, we provide integration tested prepackaged containers with popular ML libraries.
Moreover, we provide libraries to validate model, add canary in model deployment and monitoring primary metrics during serving.

3. Scale ML compute efficiently

There is surging demand for compute in Coupang — GPUs for deep learning training, storage for large datasets, and network bandwidth for distributed training. Cloud costs are high, given the large fleet of models training on the platform. Coupang ML Platform team manages a hybrid setup with compute and storage clusters on on-prem and AWS. The on-prem setup provides more customization and a powerful GPU cluster at lower costs while the cloud setup can scale on demand if on-prem resources are insufficient.

An overview of Coupang ML Platform — **Figure 1.** Coupang ML Platform overview

Core offerings of Coupang ML Platform

1. Notebooks & ML Pipeline Authoring

ML platform provides a hosted, containerized notebook service for developers to iterate on their ideas. The notebook can be launched using custom or standard containers on CPUs or GPUs.

A set of standard docker containers are maintained by the platform team containing popular ML libraries such as Tensorflow, Pytorch, Sklearn, Huggingface, Transformers, etc. The docker containers help in avoiding dependency complexity and help in writing repeatable pipelines.

For pipeline authoring, the platform provides a set of Python SDKs for data fetching, feature-store, training, and inference.

2. Feature Engineering

Coupang ML Platform offers a feature-store built to access prepared features easily in both offline and online modes. The feature store is built on top of the popular open-source project Feast.

Offline feature stores are used to share prepared features and are also used for model training. We are working with teams to onboard fundamental features such as customer insights which can be consumed by multiple downstream teams.
Online feature store is used to fetch features with low latency during inference. This serves as a model feature generator as well as prediction response cache for compute-intensive models.

3. Model Training

ML teams at Coupang use different modeling frameworks, from the popular ones such as Pytorch, Tensorflow, Sklearn, XGBoost, to the niche ones such as Prophet for forecasting.

The training stack is agnostic of framework. User written pipelines are containerized and launched on the Kubernetes cluster. A batch scheduler schedules the jobs on the desired hardware setup. Users can configure their jobs to run on any CPU type or GPU type available in the cluster. This is very useful as jobs can benefit from various CPU and GPU types depending on their characteristics and can optimize the return on investment. For example, users can configure their model training and batch inference tp rim on different GPU types, optimizing itself for speedup vs. cost of GPU.
The scheduler is configured to follow all-or-nothing resource allocation strategy. The training stack supports distributed training strategies (distributed data parallel and fully sharded data parallel) to train large models. Multi-GPU training has sped up model training workloads significantly across Coupang.

It requires significant effort to tune trainer parameters to efficiently train deep learning models. As the platform team, we benchmark trainers for popular model architectures used internally and share the most effective techniques and best practices amongst all groups who use the platform.

4. Model Inference

Post training, a model is deployed for experimentation or production for serving real traffic. The Seldon platform is used on Kubernetes for model inference. Seldon has integrations with serving libraries such as TFServing and Triton while it can also support custom python wrappers. Through this, it can cover a wide range of model frameworks, runtimes and hardware (CPU & GPU Serving).

Each ML model can be deployed as a standalone service with autoscaling. Deploying each ML model as a service provides isolation and allows integration with standard CI/CD infrastructure. Model deployment jobs run multiple validation tests (model size, training-prediction skew tests, etc) before moving into a canary phase. If canary results are successful, the model can be gradually rolled out. Developers need minimum effort (adding hooks for model validation and canary results verification) to safely get their model serving production traffic.

To serve compute-intensive features, such as embedding, in real time with low latency, we use the online feature store mentioned above. For very large models (LLMs, multimodal models), we are investing in batch and real-time GPU based serving which provides a high throughput compared to CPU serving.

Coupang ML Platform’s training workflow — **Figure 2.** Training workflow

Coupang ML Platform’s serving workflow — **Figure 3.** Serving workflow

5. Monitoring & Observability

All Coupang ML Platform services have monitoring enabled. Training cluster has resource and job monitoring dashboards (GPUs, CPUs, Memory in use). There are GPU and CPU utilization metrics for workloads.
Inference service has runtime monitoring for memory usage, prediction scores. We have plans to introduce data quality checks (anomaly detection, drift monitoring) across feature and model serving.
Cluster usage dashboards are used by developers to understand resource allocations and scheduling delays. For error debugging, the application and resource usage logs are collected from clusters and made available to developers through dashboards. There also are alerts set up for various events such as stuck or idle training jobs, inability to launch instances for training, serving or memory spikes.

Coupang ML Platform’s monitoring serving

6. Training & Inference Clusters

In the era of large datasets and deep learning models, hardware (especially accelerators such as GPU) plays a crucial role in ML development. Through an active collaboration with the the cloud infrastructure engineers at Coupang, we provide compute and storage clusters in the on-prem data center and AWS cluster.

Coupang ML Platform’s monitoring GPU utilization of its training cluster — **Figure 5.** Monitoring GPU utilization of Training cluster

Training requires instances with large memory, accelerators such as GPUs, high bandwidth connection between nodes for distributed training, and a shared storage cluster to store training data and output artifacts such as model checkpoints.

Serving requires high I/O throughput machines for performance and availability. We have a dedicated set of machines optimized for serving in multiple availability zones. Autoscaling ensures that the cluster can handle traffic spikes.

Success Stories

Through our partnership with ML teams at Coupang, we are able to systematically scale solutions which have been proven in one domain and can be generalized.

The following is a couple of recent customer success stories supported by Coupang ML Platform:

1. Training Ko-BERT to understand search queries better

ML developers working in search and recommendations launched embedding-based retrieval to augment classical term matching-based retrieval. Multi-GPU distributed training on A100 GPUs provided 10x speed up for BERT training compared to older generation GPUs and training strategy.

After success of BERT, the developers are experimenting with finetuned large language models (LLMs) to improve search quality across different surfaces. Large Language model finetuning exercises various parts of the ML platform — efficient cluster usage, distributed training strategies, high throughput GPU-based inference, etc.
We have been fairly successful in adapting and democratizing the new ML innovations through our platform.

2. Real-time price forecasting of products

Data science teams in Customer and Growth model various time series data for forecasting price, demand, page-view amongst others. The team onboarded their entire suite of pricing models from custom inference stack to our ML Platform serving. The team no longer has to maintain their deployment cluster. They can focus entirely on developing better models.

Even though we are still early in our journey, we see good traction in customers using the services as building blocks in their ML pipeline. Over the past year, there have been 100K+ workflow runs on the platform spanning 600+ ML projects. We saw massive increase in size of models being experimented on resulting in several wins in quality of Coupang services. All major ML groups at Coupang use one or more Coupang ML Platform services.
We see developers building domain-specific toolkits on the Coupang ML Platform, such as language modeling and AutoML. There has been strong interest in and adoption of CI/CD in best practice features such as online feature store and monitoring.

The coming posts will describe Coupang’s core services and applications supported by them in more detail. If you are interested in tackling Machine Learning and infra challenges that enable developers to solve hundreds of business problems and improve customer experience, consider applying for a role on our team!