Our backend strategy to handle massive traffic

How Coupang serves data from our microservices to customers at high availability, high throughput, and low latency — Part 1

Coupang Engineering
Coupang Engineering Blog
8 min readAug 12, 2022

--

By Gogi (Du Hyeong) Kim and Key (Ki Hyeon) Kim

This post is also available in Korean.

Compared to 2018, Coupang has quadrupled in revenue with fast growth in active users. In our never-ending mission to “Wow the Customer”, we are constantly adding new services such as Rocket Delivery and Rocket Fresh. Although such services give us an edge over our competitors, they require us to develop and maintain an increasingly complex network of data and data systems to serve our 18 million customers.

In this post, we will examine how we serve data from our databases to the Coupang e-commerce application with low latency and high availability through our core serving layer. To read about the lessons we learned from operating our core serving layer, read part 2 of this series.

Table of Contents

· Background and challenges
· Core serving layer
Architecture
Unified NoSQL data storage​
Cache layer
Realtime data streaming
High availability strategies
Core serving layer template
· Conclusion

Background and challenges

Coupang microservices operating behind the product detail page
Figure 1. The data of a product comes from multiple microservices. Images and brand names have been redacted.

In addition to being a marketplace, Coupang purchases items at wholesale and sells them to customers through Rocket Delivery, our own delivery service. The concept is simple in theory, but the business and data logic involved in such services is far more complicated than that of the average e-commerce platform.

As seen in Figure 1, each product contains various types of data, each demarcated with a separate-colored box. Each data type is managed through a discrete microservice architecture in the backend. For instance, the product image and title data is managed by the Catalog microservice and the guaranteed delivery date data by the Stock and Fulfillment microservice.

Furthermore, each data point is personalized to the customer and can change in real time. For example, if the daily outbound capacity of a fulfillment center (FC) is met, the FC cannot take any new customer orders, and the relevant inventory and guaranteed delivery date information must be updated on the frontend for affected customers.

If all of Coupang application’s pages each directly called data from the microservices, the microservices would always have to secure high availability, and commonly used business logic code would be duplicated on the frontend without centralized management. For such reasons, we needed a single microservice to handle the commonly used business logic and data.

Core serving layer

To address our complex data needs and the challenges mentioned above, the Materialization Platform team created the core serving layer. The core serving layer has two main purposes: to unify business logic code and to serve data to customer pages.

Overall, the goal of data serving platform is to:

  • ensure high availability of 99.99% without fail and ensure quick recovery within the shortest time possible if an incident occurs.
  • serve data with high throughput and low latency to handle high read traffic.
  • ensure consistency and freshness of data aggregated from various sources in real time.
  • unify business logic code to reduce complexity and code redundancy on the frontend.

Architecture

A simplified architecture of Coupang core serving layer
Figure 2. A simplistic look at the architecture of the core serving layer

As shown in Figure 2, the core serving layer is a microservice that is called by the customer pages of the application to provide the necessary data and business logic. In the following sections, we will discuss the features and components of our core serving layer and how they work together to serve data with high availability, high throughput, and low latency.

Unified NoSQL data storage​

At Coupang, product domain information is managed by separate microservices in the backend. As shown in Figure 1, images and titles are provided by the Catalog Team, prices by the Pricing Team, stock information by the Fulfillment Team, and so on. By separating the data by microservice, we achieve a high read throughput.

However, the core serving layer is not responsible for keeping up with every data update in all the microservices. Instead, each microservice in the backend sends the updated data to the queue and saves it to the common storage, a NoSQL database. The NoSQL database allows us to leverage eventual consistency and fetch data from all microservices in a single read. We were able to significantly reduce I/O and achieve high throughput by integrating this unified storage to our core serving layer.

Cache layer

A simplified architecture of Coupang core serving layer with cache
Figure 3. In addition to the common storage, we added a cache layer to focus on high throughput and low latency.

Although the common storage served as a persistent store and provided stability, we added a read-through cache layer to process more read traffic with high throughput and low latency. Thanks to our high-performance cache layer, we were able to serve data with ten times higher throughput and three times less latency compared to the common storage.

But there is one caveat to caching: data updates in the common storage are not always reflected in the cache layer. As a result, the cache layer may serve stale data to customer pages. For example, even if the catalog microservice changes a product image and saves it in the common storage, this change is not immediately registered in the cache layer.

To solve this issue, we implemented a cache invalidation logic. Whenever data is updated in the common storage, the data is sent to the notification queue, which uses these signals to replace the stale data in the cache layer with the most up-to-date data. This mechanism ensures that data in the common storage and the cache layer is 99.99% identical on a minutes-basis.

Realtime data streaming

The read-through cache layer provides minutes latency, but some data needs to be updated in a matter of seconds. One example is stock information. If an out-of-stock product is not immediately updated on the customer end, a customer may purchase the item only to later find out it’s sold out and that they must get a refund instead of the product they desired. This significantly damages the customer experience and may lower their overall confidence in Coupang.

A simplified architecture of Coupang core serving layer with real-time cache
Figure 4. The real-time cache layer of the core serving layer

To serve such data to our customers without any delays, we introduced real-time data streaming processing. As illustrated in Figure 4, the real-time data updater reads the changed data from the queue and immediately writes the new data to a separate real-time cache layer. The common serving layer is engineered to read the cache and real-time cache layers simultaneously and to serve the most recent data in either one of the layers, further improving latency. This second real-time cache layer ensures high throughput and delivers real-time data to the customer pages.

High availability strategies

The NoSQL common storage and two cache layers focus on high throughput and low latency. However, the most important feature of our core serving layer is high availability, which entails minimizing incidents that could undermine the customer experience in any circumstances.

To achieve high availability, each network is wrapped with a circuit breaker that isolates incidents that occur at any I/O point to prevent the incident from cascading. Then, we manually redirect the I/O of the component with the incident elsewhere.

Another important mechanism we applied for availability is the critical serving path (CSP). There are multiple customer pages in Coupang, but only a few have dramatic impacts on the customer experience and sales upon failure. The home, search, and checkout pages are examples of critical pages that must not be interrupted at all costs.

For high availability of such critical pages, we separated them to a CSP cluster and the non-critical pages to a non-critical serving path (N-CSP) cluster. Because the clusters are independent of one another, an incident in the N-CSP cluster does not affect the CSP cluster. However, when there is an incident in the CSP cluster, the customer pages facing the CSP cluster dynamically shift to face the N-CSP clusters and the critical customer pages are not interrupted.

Coupang core serving layer for high availability with clusters
Figure 5. The CSP and NCSP clusters aid in delivering high availability

Core serving layer template

Up until now, we’ve discussed how our core serving layer works and how the microservices of the product domain serve data to the customer pages.

However, we have many other domains that also serve data to Coupang customers. For example, the order domain serves order and shipping information to inform customers about status updates of their orders. Moreover, our other applications such as Coupang Eats have the same data serving needs.

If we were to have separate codebases that perform the same functions as the core serving layer across different domains and applications, code would be redundant and management difficult. We wanted to create a standard way to adapt the core serving layer to multiple domains.

To simplify the implementation of the core serving layer, we created a template that shares its core business logic but can be configured to the specific needs of each domain. Using our template, users simply input basic configuration information such as the addresses of the common storage, cache, real-time cache layers to create a customized core serving layer for their domains.

Domains in Coupang and Coupang Eats applications utilizing the data serving template to handle massive traffic
Figure 6. Instances of domains in our Coupang and Coupang Eats applications utilizing the core serving template

Conclusion

The core serving layer was our solution to providing a unified and systemic approach to serving data from multiple microservice systems to the Coupang application pages at high availability, high throughput, and low latency. To ensure reusability and standardization across domains, we also distributed a core serving layer template that provides the data serving foundation and unified business logic.

Check out part 2 of this series, where we discuss some of the challenges we overcame while operating the core serving layer.

If working with a complex microservice architecture excites you, explore open positions at Coupang to rocket-start your career growth.

Twitter logo

Coupang Engineering is also on Twitter. Follow us for the latest updates!

--

--

Coupang Engineering
Coupang Engineering Blog

We write about how our engineers build Coupang’s e-commerce, food delivery, streaming services and beyond.