Data platform 2022: Global expansion in petabytes

Part I about Coupang’s data platform, focusing on the data ingestion, machine learning, and experimentation platforms.

Published in

Coupang Engineering Blog

12 min readJun 17, 2022

By Youngwan Lim, Michael Sommers, Eddard (Hyo Kun) Park, Thimma Reddy Kalva, Ibrahim Durmus, Martin (Yuxiang) Hu, Enhua Tan

This post is also available in Korean.

Coupang delivers millions of products each year at rocket speed and scale for a growing number of customers. Behind the magic of our game-changing services, such as next-day delivery, is a complex data platform that is constantly evolving under the careful supervision of our talented engineers across the globe.

· Data platform 2022
· Ingestion platform
· Machine learning platform
· Experimentation platform
· Conclusion

Data platform 2022

Coupang data platform architecture — **Figure 1.** The overall architecture of our current data platform.

Our vision for the data platform is to provide robust and user-friendly tools that empower internal users to transform large raw datasets into analytics that help us make critical and time-sensitive business decisions. The users of the data platform are from approximately 50 different teams and include everyone from engineers and business analysts to C-level executives and external users such as suppliers and advertisers.

As you can imagine, the data platform supports hundreds of users and petabytes of data. The platform processes over five thousand jobs and extracts over 2 TB of data from approximately 70 different sources every day. On top of that, we are expecting several fold increases in these numbers as we expand and grow our business.

In this post, we discuss how our data platform has evolved since 2019, focusing on the ingestion, machine learning, and experiment platforms. For details about the analytics platform, check out part 2 of this series.

Ingestion platform

Let’s start off by examining our ingestion platform, the gateway between our data sources and data platform. Its goal is to efficiently ingest raw data in varying schemas from a wide range of data sources into a single data lake.

Challenges

Previously for data ingestion, we relied on hand-crafted and source-specific pipelines built on an ad-hoc basis. However, with the enormous growth of data volume and analytic calculations, building new extraction and load pipelines for each source became difficult to manage at scale and limited on-demand analytics needed to make urgent business decisions.

In addition, these extraction pipelines required significant maintenance efforts in case of upstream schema changes. Such dependence on manual maintenance increased the chances of failures due to sudden data surges and skews.

Although this system worked for us in the early stages of the business, it became error prone and inefficient with our massive data growth.

Universal data ingestion: Source-agnostic and automated

To solve the challenges mentioned above and to move away from a high-touch service model, we introduced a scalable and self-service data ingestion platform.

We built the universal data ingestion (UDI) framework to be source-agnostic, fully automated, and user-driven (not data engineer-driven). The framework also aims to standardize ingestion processes for our 70 different data sources. UDI was designed with the three principles that we refer to as the 3Vs: Velocity, Variety, and Volume.

Velocity
To increase velocity, UDI automatically orchestrates ingestions processes that were previously manual and time-consuming. With UDI, ingestion is simple with minimum-to-none engineering time involved.

For example, data batches go through an extensive inspection before ingestion. We check that the source data can be securely accessed according to security protocols. For structured data use cases, we check the data schema to only pull relevant sections. These and other comprehensive checking and filtering processes are automated with the UDI.

In addition, UDI automatically estimates the batch size of ingestion. This process is important for RDBMS-based data sources that contain critical business data. To avoid incurring an excessive load, UDI estimates an efficient yet secure amount of data that can be pulled in each batch.

Variety
Initial pipelines focused on connecting MySQL to Hive, the most used configuration in our production. To support a more diverse range of data sources, we built the self-service ingestion as a plug-in framework on top of the Hadoop ecosystem. Users can add support for new data sources with incremental cost and minimum development efforts by adapting our reusable plug-in framework.

Volume
From the very initial phase, UDI was designed for high scalability and followed open standards. To efficiently transform our large volumes of data into analytics, we employ the systems below.

Data storage using AWS S3. Our data lake is based on S3, which is horizontally scalable and cost efficient. S3 allows us to hyper-scale for our growing data volume.
Data consumption on Hive. The main consumption layer is employed on Hive, a well-understood and battle-tested solution for warehousing needs that is widely used.
Data processing using Spark. The primary data processing engine used for extract, load, and transform (ELT) is based on Spark. It is faster than other MapReduce-based offerings due to natively supported in-memory processing semantics.
Ad-hoc queries using Presto. For interactive and ad-hoc queries where speed is key, we leverage Presto to deliver faster analytics.

Next steps

In the future, we want to introduce a streaming solution, including robust event-based ingestion and generic event consumers. Additionally, we want to gradually deprecate the direct change data capture (CDC) sync process to adopt a strongly typed log-based ingestion process. Lastly, we will continue to seek ways to enhance the end-to-end customer experience by improving the self-service functions of the ingestion platform.

Machine learning platform

Now we will go through how we use the ingested data to train machine learning (ML) models that provide us with valuable insights. This section focuses on our internal machine learning platform, used by Coupang data scientists and ML engineers to seamlessly build, train, and deploy ML models at scale.

Challenges

Previously, data scientists and ML engineers across different teams managed their own ML infrastructures. This negatively impacted our engineering efficiency in two ways. First, each team spent large amounts of time setting up similar data science environments utilizing the same packages and tools. Second, each team separately prepared data, built feature pipelines, and trained and deployed models without a standardized process. Due to a lack of standardization and central management of resources, it was difficult for teams to adequately access GPUs. Such redundant engineering efforts resulted in inefficient resource utilization and low ML throughput.

We needed to standardize the model building and GPU utilization processes in a simple and user-friendly platform.

Unified model training platform

The ML platform was built as a scalable training platform that supports popular ML frameworks and tools, such as TensorFlow and PyTorch, to support building feature pipelines and offline training. The platform offers data preparation capabilities, interactive notebook environments, distributed training using GPUs, hyperparameter turning, and model serving.

Our ML workflow consists of data preparation, model training, and predictions, each of which we will detail in the following sections.

Coupang machine learning platform — **Figure 2.** Numerous teams at Coupang utilize the ML platform to develop and train ML models in an efficient and standardized manner.

Synchronized data preparation at scale

We leveraged our existing data pipeline orchestrator based on Airflow to produce high-quality features and label datasets required for model training. This orchestrator was already fully integrated with our data store and large-scale data processing engines such as Spark, Hive, and Presto. We also developed an SDK to easily synchronize data across the training platform. Together, the orchestrator and data store SDK allow engineers to create, manage, and share large training datasets.

Quick model training setup and tracking

The model training and development platform was built on open-source tools and aims to standardize training environments for all engineers at Coupang. Here are some of its main features.

Container images. We regularly publish container images pre-packaged with popular ML/DL frameworks (TensorFlow, PyTorch, XGBoost, Scikit-Learn, etc.) and data science libraries (Pandas, NumPy, etc.). Using our platform, engineers can start model development by simply launching a Jupyter container.
CLI and API endpoint for GPU utilization. For efficiency, the ML platform offers a command line interface (CLI) and an API endpoint through which distributed training and GPU computing can be easily accessed.
Hyperparameter tuning. The platform also supports various hyperparameter tuning algorithms that can identify optimal hyperparameters for experiments.
Model visualizations. Users can quickly evaluate and check training progress of their models using our experiment tracking and visualization tools.

Machine learning models trained MoM at Coupang — ***Figure 3.*** *Since the launch of the ML platform last year, the number of models trained month-over-month (MoM) more than tripled in six months.*

Prediction containers for efficient model serving

To further reduce the time between model training and inference, we built model deployment tools and a model serving infrastructure. We utilized our in-house container orchestration platform for Kubernetes and customized open-source tools to support model serving. Engineers can simply write a Jsonnet spec with autoscaling and expose REST and GRPC inference endpoints to serve their models.

Next steps

Currently, the Search Ranking team uses the ML platform to train BERT and DNN models to improve query understanding, search results, and purchase rates. The Advertising team trains deep interest networks to improve click-through rates and conversion rates. The Marketing team uses it to present customers with personalized recommendations and promotions to improve engagement and conversions to paid memberships.

Our next goal is to improve the model store, where engineers store and manage varying versions of their ML models. We are also collaborating with internal teams to incorporate a feature store and enhance integration of the serving layer with the prediction store.

Experimentation platform

In this last section, we will introduce our experiment platform and how it helps us make data-driven decisions that enhance the customer experience across the globe.

Adding new features to any app is tricky because their impact can be difficult to measure. To establish a scientific and customer-centered way to approach feature launches, Coupang uses A/B testing. Every feature on our app from the delivery preference page to the item recommendations carousel are all born through rigorous A/B testing.

Challenges

Because A/B tests are conducted on actual customers, an ineffective feature can negatively impact sales by blocking customers in the purchase funnel. Our first challenge was to quickly catch and automatically stop tests that could damage our business.

The second challenge we tackled was engineering efficiency in launching A/B test monitoring formulas. It took at least two weeks and significant engineering resources to implement the complex monitoring formulas requested by data scientists. Examples of these jobs include monitoring misassigned groups, guardrail latency, flat exposure log, no traffic, and so on for each A/B test. Furthermore, engineers needed to restart the monitoring job if test queries were modified midway through the experiment.

Lastly, our experiment platform handles 3 billion events and a thousand live experiments per day. As Coupang expands internationally, we must scale out to service multiple regions, app versions, and app permutations. These additions greatly increased the engineering complexity of calculating metrics for all our A/B tests in an efficient manner.

Detecting and stopping negative experiments

Previously, the test owner was notified if buyer conversion was negatively impacted by a significant margin with p-value ≤ 0.001. If the owner took no action within 12 hours, the A/B test was automatically stopped. Although this system worked, we wanted to detect bad A/B tests in critical domains faster and stop them automatically.

At first, we created a force-stop system which runs by batch every 4 hours to stop bad tests. However, 4 hours was still a lengthy wait, especially with potential sales on the line. Also, this system still relied on manual interference from test owners.

To make decisions faster and automatically, we launched a mission-critical circuit breaker that detects and stops bad experiments from hurting conversion rates in our main domains in real-time.

The circuit breaker system of the Coupang experiment platform for A/B testing — **Figure 4.** The circuit breaker leverages distributed event and Spark streaming to calculate critical A/B test metrics every two minutes.

In the circuit breaker architecture, each Spark job calculates metrics every 2 minutes and stores it in S3 or MySQL. We decided on a 2-minute interval due to streaming job initialization times. Based on config, the circuit breaker filters and deduplicates data to reduce computations. If the circuit breaker metric is negative by a significant margin with p-value ≤ 0.001 for five consecutive turns, the A/B test is automatically stopped, and an alert is sent to the test owner. With our circuit breaker, negative A/B tests are terminated without any manual interference in approximately 10 minutes.

Figure 5 shows one example where the circuit breaker prevented a significant drop in sales by suspending an A/B test within a few minutes. Without the circuit breaker, conversion rates in the purchase funnel may have dropped by approximately 20% in the cart page for at least 4 hours.

Coupang experiment platform interface and example — **Figure 5.** An example of an A/B experiment with significantly negative effects. This experiment was stopped within a few minutes by our circuit breaker, preventing sales losses.

Efficiently integrating experiment monitoring jobs

Our next challenge was to add new monitoring jobs without consuming large amounts of engineering resources. To maximize efficiency and flexibility, we designed a query-based monitoring system that exposes all metrics data from MySQL to a separate time series database. By leveraging this new monitoring system, monitoring jobs with complex formulas could be transformed into simple queries.

The detailed architecture of the monitoring system can be found in Figure 6. First, the time series database periodically collects data from the metric-exposer, which contains A/B test metric data. Then, the rule-based monitoring system runs queries every minute to detect A/B tests that meet certain criteria. The rule-based monitoring system passes the detected A/B tests to the message-generator, which acts according to the user config, such as notifying the test owners or halting the A/B test.

New monitoring system of the Coupang experiment platform for A/B testing — **Figure 6.** The new monitoring system of our experiment platform

With the new monitoring system, engineers can write a simple query like the one below to add a monitoring alert.

AVG(test_running_time{status=’RUNNING’}) by (experiment_id, result_date) > 1000000

Scaling A/B tests internationally

Finally, to accommodate our international expansion and business diversification, we employed Apache ZooKeeper. Here’s how we used ZooKeeper to reliably serve our needs.

Storing config data. ZooKeeper contains all the configuration data we need for A/B testing, such as platform and app version, country information, A/B group percentages, and so on. When the test owner changes the config on the experiment platform, the change is immediately propagated to 10,000 servers, thanks to ZooKeeper. This helped us improve performance and network latency of A/B test local randomization.
Stabilizing leader elections. We added ZooKeeper observer nodes in each region to handle user connections. This frees up ZooKeeper leaders and followers to focus on leader election without user connections, since a stable leader election is critical to reliably running ZooKeeper clusters.
Guaranteeing network bandwidth. Network bandwidth between ZooKeeper nodes is a key factor with our data volume. At boot-up time, more than 100 GB of data is sent, and if network bandwidth is not guaranteed, an incident may occur. To prevent incidents due to massive data volumes, we used a network bandwidth guaranteed instance type.
Setting up for scalability. As the number of connections rapidly increased with our business growth, we needed to scale out ZooKeeper nodes in a timely manner. We checked the capacity of the ZooKeeper cluster in advance and found that with our usage pattern, each user connection took around 3 MB. We used this memory usage per connection and the number of the connections to set up JVM options for robust scalability.
Improving network latency. For network latency, we set up SDKs to find the nearest observer nodes in the same region based on virtual data center (VDC) config.

How Coupang uses ZooKeeper for international A/B testing — **Figure 7.** The modified ZooKeeper architecture for the experiment platform. With this new architecture, we can service multiple regions in a reliable and robust manner.

Next steps

We want to prepare for our international customers and business endeavors by improving our experiment implementation. Our next goal is to reduce implementation complexity by developing feature flags, dynamic config override, and dynamic targeting.

Conclusion

In this post, we discussed the improvements we made to our data platform over the past years to handle our ever-growing customer base and data volume. Check out part 2 of this series, which discusses our analytics platform in more detail.

If building a data platform at a large data-driven company like Coupang excites you, see our open positions.

Data platform 2022: Global expansion in petabytes

Part I about Coupang’s data platform, focusing on the data ingestion, machine learning, and experimentation platforms.

Table of contents

Data platform 2022

Ingestion platform

Challenges

Universal data ingestion: Source-agnostic and automated

Next steps

Machine learning platform

Challenges

Unified model training platform

Synchronized data preparation at scale

Quick model training setup and tracking

Prediction containers for efficient model serving

Next steps

Experimentation platform

Challenges

Detecting and stopping negative experiments

Efficiently integrating experiment monitoring jobs

Scaling A/B tests internationally

Next steps

Conclusion

Written by Coupang Engineering