Lessons learned from handling massive traffic with cache

Solutions to the complex challenges Coupang faced in serving data to application pages amid growing user traffic — Part 2

Coupang Engineering

Published in

Coupang Engineering Blog

9 min readAug 19, 2022

By Gogi (Du Hyeong) Kim and Key (Ki Hyeon) Kim

This post is also available in Korean.

In 2021 alone, the number of active customers on our Coupang e-commerce app grew by 21%. As server engineers, our job is to make sure every customer gets the same uninterrupted, “Wow” experience at Coupang. To ensure data from our backend services is securely delivered to customers at high availability, high throughput, and low latency, we developed the core serving layer as a microservice in our ecosystem. For more details on the core serving layer, refer to our previous post.

In this post, we will briefly summarize the role of the cache layers in relation to the core serving layer and in the core serving platform, and how they posed interesting and unique challenges for our engineers.

· Cache layers in the core serving layer
· Lesson 1: Managing partial cache node failure
· Lesson 2: Quickly recovering failed nodes
· Lesson 3: Balancing traffic between nodes
· Lesson 4: Handling traffic spikes using local cache
· Conclusion

Cache layers in the core serving layer

The core serving layer is composed of two cache layers: a read-through cache layer and a realtime cache layer. With the massive amount of customer traffic at Coupang, each cache layer is composed of 60 to 100 nodes and processes up to 100 million requests per minute.

On our core serving layer, 95% of the incoming traffic is processed by the cache layers instead of the common storage layer, which is our persistent storage. In other words, the availability of the core serving layer is highly dependent on that of the cache layers, making them an integral part of the flawless customer experience we strive to provide.

The overall architecture of the Coupang’s core serving layer — ***Figure 1.*** *The overall architecture of the core serving layer*

Lesson 1: Managing partial cache node failure

The first challenge we faced was managing partial node failures in the cache clusters. It’s common to see individual node failures in the cache layers, which are run on cloud servers. Sometimes, multiple nodes can fail simultaneously due to issues on the underlying host side. Node failures can last from one minute to up to ten minutes. We needed a way to minimize the destabilizing impact these inevitable node failures caused to our system.

Although we had a circuit breaker mechanism that could isolate the component where the incident occurred, a single node failure only affected a small portion the overall traffic, and the circuit breaker was not sensitive enough to detect it.

Existing methods Coupang had to handle node failures — ***Figure 2.*** *Existing methods we had to handle node failures*

We first programmed the cache cluster to change the cluster topology when a failed node was discovered during topology refresh. However, this process took several minutes, during which unprocessed requests could raise timeouts or pile up in the queue. In addition, the changed cluster topology sometimes failed to match the actual TCP connection usage, resulting in traffic still being directed to failed nodes. Like the circuit breaker, this mechanism did not meet our high availability and low latency demands.

Fast detection of node failure

To enhance system stability, we needed a robust solution to re-route the traffic going to failed nodes. In addition to monitoring topology refresh, we decided to monitor TCP connection speeds. Because the typical response time of the cache layer is mere milliseconds, connections without responses for one second were marked problematic and the connections to that node are automatically closed.

By using the TCP connection speeds, we were able to guarantee the discovery of failed nodes within seconds, in contrast to the tens of seconds or even minutes that it previously took.

Monitoring TCP connections to detect failed nodes in a matter of seconds at Coupang — ***Figure 3.*** *By monitoring TCP connections, we were able to detect failed nodes in a matter of seconds.*

Lesson 2: Quickly recovering failed nodes

In addition to detecting node failures, we wanted to quickly recover failed nodes to maintain high throughput and availability.

When a node fails, it must go through a full sync process for recovery. During the full sync, the write commands incoming to the master node were saved in the buffer. However, because the incoming traffic to the cache layer was higher than the amount of traffic the buffer could handle, we experienced frequent failures during the full sync and node recovery process.

As short-term solutions, we resorted to pausing the cache invalidation for five to ten minutes during recovery or replacing the old cluster with an entirely new cluster. Such short-term solutions required heavy engineering interference by engineers, and we needed a robust, long-term solution.

Finding the root cause

At first, we estimated the number of write commands that would occur during the full sync process to find the appropriate buffer size. We conducted tests with various buffer sizes configured using our write command estimations, but each test brought varying results without definitive success.

In the process, we found that the number of write commands greatly exceeded our estimations during the full sync. When we dug deeper into the issue, we discovered that new replicas created during the full sync process were serving empty datasets to the core serving platform. This caused the system to assume the cache cluster did not have the appropriate data, ultimately resulting in a rapid increase of write commands.

The root cause of Coupang’s full sync failures was due to traffic going to empty replicas — ***Figure 4.*** *The root cause of our full sync failures was due to traffic going to empty replicas*

Blocking traffic to defective replicas

Now that we knew the root cause of the cluster recovery failure, we could formulate a long-term solution.

We used the data size or status information of the replica to determine whether it could be used to serve data. If the information showed the replica was not available for serving, we blocked all incoming traffic to that specific replica. By simply blocking traffic to defective replicas, we eliminated the rapid increase in write commands and solved the full sync failures. We applied this customization to all our cache clusters, greatly improving stability.

To test our solution, we conducted an experiment on two test cache clusters with 60 nodes handling peak traffic requests. The first cluster was our original cluster configuration and the second was the modified cluster configuration where traffic was blocked to defective replicas.

We shut down 12 nodes in each cluster to test dramatic failure conditions. The first cluster demonstrated spikes around 500 to 1000 milliseconds at P95 latency and saw application CPU instability with minor and full garbage collection (GC). In comparison, the second cluster with our customization demonstrated spikes less than 100 milliseconds at P95 latency and CPU utilization with minor GCs that were resolved without any other errors.

***Figure 5.*** *A comparison of the original and customized clusters*

Lesson 3: Balancing traffic between nodes

The third challenge we faced with cache nodes came from imbalanced traffic. The amount of traffic processed by each node differed by five to at most ten times, also resulting in discrepancies in CPU usages. Moreover, we were creating cluster sizes based on nodes with the most amount of traffic. The imbalanced traffic led us to create unnecessarily large clusters for our traffic size. This was an inefficient waste of our resources.

Cache client reconfiguration

***Figure 6.*** *The cache client was directing traffic to the same node repeatedly.*

Upon examination, we found that the cause of the imbalance in traffic among cache nodes stemmed from the cache client. The cache client found the shard and node with the fastest response time and automatically connected traffic to that node repeatedly.

To solve this issue, we reconfigured the connection mode of the cache client. Rather than directing traffic to a fixed node with the fastest connection time, the cache client now randomly selects a node for every request in a dynamic manner, resulting in an even distribution of traffic.

***Figure 7.*** *After reconfiguration of the cache client, the traffic was evenly distributed among shards.*

The results of the reconfiguration are shown in the figure below. The point marked 04:30 on the x-axis is when we applied the reconfiguration of the cache client. As shown in the graph on the left, traffic that used to fluctuate greatly between nodes became almost equalized. As shown in the graph on the right, CPU usage slightly increased due to dynamic request assignment. However, this was a small cost to pay for traffic balance and overall system stability.

***Figure 8.*** *After reconfiguring the cache client, we saw that both the traffic and CPU usage were more evenly distributed among nodes.*

Lesson 4: Handling traffic spikes using local cache

The last challenge we want to share is related to throughput. Although growing user traffic is good news for business, it may not always be welcomed by server engineers.

There have been several events at Coupang when user traffic went beyond our server capacity, such as our pre-order events for digital devices and mobile phones. Luckily for us, these events were preplanned, and we could prepare for the increases in traffic by securing additional servers in advance.

However, during COVID-19, we frequently saw unexpected spikes in traffic. Because we were unaware of the cause of these spikes, we had no choice but to secure three times the recommended server capacity for application stability.

Traffic analysis

Having three times the recommended server capacity was costly and inefficient. To get to the bottom of these sudden spikes, we stored the incoming requests to the message queue and used MapReduce to analyze user and product information of the requests on a minute basis.

Our analysis showed that users making requests on certain products took up a significant portion of the entire traffic. In fact, the analysis revealed that most traffic spikes during COVID-19 were related to face mask restocks!

Local cache

To solve spikes in traffic due to a small number of users and products, we added a local cache layer. Although a local cache layer improves application performance, it can add challenges related to cache invalidation and increasing GC counts and times.

Because there wasn’t a separate cache invalidation process for this local cache layer, we limited cache time to one minute and only handled data that could be updated with eventual consistency, not strong consistency. To deal with our GC issues, we saved data in a byte array format instead of the DTO format. This increased serialization costs but reduced GC cost significantly.

The local cache layer only processed around 35% of the traffic during normal traffic periods, but during sudden traffic spikes, it could cover up to 70% of the incoming traffic.

Non-blocking I/O

Lastly, to improve server stability during traffic spikes, we switched from a blocking I/O-based application to a non-blocking I/O (NIO) application. Since our existing data storage and cache systems already supported NIO, the migration was easy. With this change, we reduced CPU usage by more than 50% and used NIO threads to minimize CPU overhead.

Conclusion

At Coupang, no challenge is too big or too small. We tackle all challenges with the same determined mindset of delivering excellence to our customers. The improvements we made to our core serving layer ensured that we could continue serving data to our customer pages at high availability, high throughput, and low latency.

If you are a server engineer with a knack for implementing simple solutions to complex problems, view our open positions and join us.