Eats data platform: Empowering businesses with data

How Coupang Eats built a configuration-driven data ingestion, processing, and utilization system — Part 1

Coupang Engineering

Published in

Coupang Engineering Blog

8 min readAug 26, 2022

By Fred Fu

This post is also available in Korean.

Online food order and delivery is an extremely competitive market, where speed is key to survival. Coupang Eats (referred to simply as Eats) is the food delivery subsidiary of the South Korean e-commerce giant, Coupang.

In this post, we will elaborate on how the Eats data platform team built a generic and configuration-driven data processing system that accelerates our growth by automating critical business operations.

· Introduction
· Architecture of Eats data platform
· Data processing pipelines
∘ Non-real-time
∘ Near real-time (NRT)
∘ Pure real-time
· Conclusion

Introduction

Eats is a relatively new business that launched in late 2019. At first, we were busy setting up the operations and making sure everything ran smoothly. But now, we are at the stage of our business where our focus is not only ensuring the applications is stable, but also making sure it runs in a more efficient and intelligent manner.

This means using data automation and complex machine learning (ML) models to accelerate business for more customers at less engineering costs. For example, we currently train ML models for delivery time estimations, make automatic promotions to customers based on real-time user tags, and create dynamic data services to support data visualizations and metric calculations. Some details about how we use data science to overcome food delivery challenges can be found in our previous post.

The foundation of such smart and automated operations falls upon a robust and comprehensive data processing platform that can scale to our customer growth and back up all our business needs, whether they be real-time analytics or offline data serving.

In the early stages of our business, we did not have a centralized data platform. Training and serving ML models and other data science services were run on a team-by-team basis with little coordination. Some of our pain points include inefficient feature engineering, slow metrics monitoring and alerting, lack of complex data visualizations, and more. Not only was this inefficiency a wasteful of our engineering resources, but it also became a large bottleneck in our expansion efforts. Follow our account for our next post, where we discuss these pain points in more detail.

Architecture of Eats data platform

A data platform is responsible for managing the lifecycle of the data processing flow from beginning to end. A typical data processing lifecycle is composed of the following stages:

Data ingestion is the first stage, where data is taken in from various sources. To ensure smooth data flow in the following stages, data is prioritized and categorized during this stage.
Data pre-processing consists of filling empty values, running standard formatting, checking for data quality, filtering data, and more. The main purpose of pre-processing is to prepare the data for data science and machine learning tasks.
Data processing involves turning raw data into desired output types as needed for complex data analysis. This stage also includes sinking data to the appropriate data storages.
Data utilization is generating analytics from the processed data. In this final stage, data values are gathered and fed into a wide range of DaaS (Data-as-a-service) that help solve business issues intelligently.

The Eats data platform follows the four steps above, and you can see its detailed architecture in Figure 1. The platform is a one-stop system that can support all our data processing needs in an efficient manner. Business analysts (BA), data engineers, and data scientists alike use our flexible platform as an integrated system that provides for all their diverse data needs.

The overall architecture of the Coupang Eats data platform — ***Figure 1.*** *The overall architecture of the Eats data platform*

Data processing pipelines

Food delivery is a complex business because it is a three-sided marketplace with customers, Eats delivery partners (EDP), and merchants that requires many real-time calculations. For instance, assigning an order to a delivery partner must be completed in a matter of seconds for the fastest delivery to the customers. However, other tasks, such as segmenting users for targeted ads, aren’t as time sensitive and do not have to be conducted in real-time.

Our data platform was designed to accommodate tasks with such drastically different time requirements. In this section, we focus on the data processing pipelines of our data platform in more detail. The calculation engines are responsible for data processing in non-real-time, near real-time, and real-time.

The data processing pipelines of the Coupang Eats data platform — ***Figure 2.*** *The data processing pipelines*

Non-real-time

Time efficiency: at least 1 hour

The non-real-time pipeline runs in batch mode with a job scheduler. This pipeline supports ML related feature production, user profiling and tags generation, and data visualization.

Before developing our data platform, the biggest pain point in the non-real-time pipeline was pushing offline features and signals to the online storage. Uploading the massive size of the processed batch datasets was slow, inefficient, and costly to our systems.

To solve this pain point, we developed a configuration-driven pipeline to accelerate this process. With a configuration-driven approach, engineers can simply define new configurations to add a new data source, making the process easy and hands-free. In fact, it only takes the three simple steps defined below.

Define data sync information in our metadata management system. We can even define an entire feature group, which can contain many predefined features for a specific business scenario and the mapping relationships between features and Hive table columns.
Provide a generic Spark SDK which will read the Hive data and sync the features to the online feature store according to the predefined configuration of metadata and mapping.
Create a Spark job in the job scheduler and refer to the SDK to sync the data to our online storage automatically.

Non-real-time data pipeline of the Coupang Eats data platform — ***Figure 3.*** *The non-real-time data pipeline*

Near real-time (NRT)

Time efficiency: ≥ 30 seconds

Although real-time data streaming is supported by various applications, it is a heavy load on the data infrastructure and requires high technical expertise for configuration. Real-time data streaming also lacks the flexibility to support changing business requirements.

Because of such disadvantages, we designed a near real-time (NRT) pipeline which takes advantage of high-performing OLAP engines. The NRT pipeline can generate signals close to real-time through scheduled jobs that execute SQL scripts. The NRT pipeline supports real-time feature generation for ML prediction, real-time user tags building, and real-time business metric dashboarding and alerting.

The NRT engine is a generic pipeline that is configuration-driven, like the non-real-time pipeline. The code can support various types of business through SQL, a language familiar to data engineers, data scientists, and BAs. Since servicing the NRT pipeline for over half a year, the efficiency of our feature production has significantly improved.

Here’s how our internal users use the NRT pipeline:

Ingest upstream Kafka messages to the OLAP engine or read a Hive table data in cloud storage.
Execute OLAP engine SQL at scheduled intervals (30 seconds to 1 hour). During this step, wide tables are created by joining multiple table sources ingested from Kafka or cloud storage, both of which store real-time data.
Using the wide tables generated from the previous step, the data scientist creates scheduled OLAP engine SQL jobs to generate near real-time metrics and signals.
The job scheduler will execute the SQL at the defined interval and send produced signals to Kafka for downstream jobs for consumption or write to the data storage.

Near real-time data pipeline of the Coupang Eats data platform — ***Figure 4.*** *The near real-time data pipeline*

Pure real-time

Time efficiency: < 1 second

The NRT pipeline could cover about 80% of our real-time feature use cases in production. However, it could not support very high efficiency and low latency scenarios like flood detection and risk control.

Such urgent tasks relied on pure real-time data. However, the pure real-time pipeline uses Spark and another distributed processing engine. Considering that most data users were unfamiliar with writing Spark streaming codes, it was impractical to expect them to be able to continually write code on-demand for real-time data tasks.

Again, we developed a configuration-driven pipeline that could be easily utilized for real-time data processing by all users. Below is a description of its process.

Ingest the necessary data into a Kafka topic.
Dynamically partition event data to different downstream operators according to real-time feature meta configuration through the broadcast system.
Calculate values of real-time features based on different pre-implemented aggregator according to the definition of feature metadata. Output the calculated value to Kafka or downstream operator either instantly or periodically.

The pure real-time pipeline currently supports about ten frequently used statistical functions such as SUM, COUNT, UNIQUE COUNT, TOPN. In addition, the pipeline supports multi-dimensional statistical functions. With this pure real-time pipeline, we can define data streaming in a codeless way. This drastically reduced the cost of calculating real-time features and increased our efficiently.

Pure real-time data pipeline of the Coupang Eats data platform — ***Figure 5.*** *The pure real-time data pipeline*

Conclusion

We examined the overall architecture of the Eats data platform and its data processing pipelines. Because our business requirements differed in time sensitivities, we set up three pipelines for non-real-time, near real-time, and pure real-time data requirements. These data pipelines are all built to be configuration-driven, which allows users with even minimal code experience to use it seamlessly.

If solving complex business problems with intricate data engineering is a challenge you want to pursue, view our open positions.