Scaling to 5M RPM

Enes Turgut
Trendyol Tech
Published in
6 min readDec 1, 2022

In Trendyol, we have a responsibility to serve content pages and data. We also serve content data as bulk for favorites, collections, or even checkout pages. Shortly, we will get involved whenever you see the below screens.

What are our needs or expected loads?

As expected, we usually get more traffic in shopping seasons or event/discount times. While we get around 1M rpm (requests per minute) on a typical day, the load goes up to about 3M rpm in busy times. In those times, we could not precisely meet the requirements. Our response time was doubling the usual, and we even had more timeouts than we could ignore.

Our next goal and expectation were to hit at least 8M rpm with, at most, the same resources. So the problem was that we could not scale as much as we needed, let alone what we expected next.

Who does enjoy challenges? Yes, certainly us :)

How did we find the cause of the problem?

So we did a profiling to understand which code part takes more time and uses more resources. First, we used a Java profiler to narrow down which parts to track. After then, we added some custom New Relic traces in the code. When we looked at the results, we saw that a specific code part was the most time and resource-consuming.

That part was a conversion logic to convert the Couchbase document to the data we need to perform business logic and serve as bulk. It was running for every request and every content. We started to think about whether we could do this conversion logic asynchronously. We could run this logic once for every content. For most of the parts, the answer was: YES!

It is not enough to do your best; you must know what to do, and then do your best. — W Edwards Deming

Can we handle the data asynchronously?

We are using Couchbase as the data source. When we looked at how we can convert our Couchbase data asynchronously, we saw that Couchbase has a database change protocol (DCP). DCP can give us a stream for the mutations. So we can listen from our old data source, do the async conversion, and write to our new data source. You could check Ahmet Hatipoglu’s article if you wonder about the details.

Now, it is time to determine whether we can optimize the data.

Do we need all the data?

We used the same data model for the content page and bulk requests. For bulk requests, several clients need different data. So we needed an analysis to understand who needs which part of the data. We did a comprehensive analysis with all of our clients together.

After the analysis, we realized that we only needed some of the data, and nearly all of our clients required almost the same parts. But we could not touch the schema because some of our clients use the same data source directly. We are tightly coupled. In conclusion, separating the data sources might be a step toward a solution.

What do we have now so far?

A service that listens to changes from a content data source performs the heavy conversion and writes to a new data source. And we know from the analysis that we only need some of the data. So we can omit all unnecessary parts while we do the conversion.

Are you ready? We saved %77 of the data (1.28TB). What a relief!

What now?

We need something that uses the data we prepared. We wanted to separate bulk content needs, and product detail page needs just because their clients, scale needs, and business rule sets differ. So, it is time to write a new service.

What can be done differently?

Maybe technologies we use. The old service was using Java 8 and Spring boot. We had a stable experience with the Quarkus. Perhaps we could lower resource usage and booting time to be more scalable. Or we were using Flux to acquire the documents from the Couchbase, and maybe we could use CompletableFuture for a change. After listing our options, we developed many POC applications and ran many load tests. But the results could have been better. At least not for the effort we are going to need. The research should continue.

By the way, I’ve been a gopher for four years. I believe that we can get the documents effectively with goroutines in parallel. So I wanted to try that. Then, I looked at the official Couchbase Go SDK, which has a “bulk document get” feature. That was great! The needs were suitable for using Go. And I wanted to try both options.

I ran load tests on both parallel goroutines and the bulk feature. Response times and resource usage was significantly improved for both of them. The bulk feature was better than our simple implementation. But something was off. Network usage was four times more. We could give a configuration to the Couchbase connection in Java SDK that enables compression. But Go SDK does not have a compression configuration in their cluster configuration options. So I missed configuring accordingly. Sadly compression feature could not be controllable via Couchbase Go SDK, unlike Couchbase Java SDK. So I looked into its code, did some painful debugging, and discovered that compression could be controlled via a query variable at the end of the connection string.

couchbase://{HOST_HERE}?compression=true

After completing all that I’ve told you, these are the results:

  • We have a massive ram usage gain, and memory usage decreased from 800MB to 60MB per pod.
  • Response time is gone to half. 10ms to 5ms. And it is more stable against high throughput.
  • CPU usage is decreased drastically. To hit 5M with the old service, we needed more pods. So this will be a comparison of the total CPU usage. It is now 130 cores instead of 300.
  • The total document size is now less than a quarter of before, from 1.67TB to only 386GB.
  • Naturally, the network load is a lot lower.
  • Booting time is now 85 milliseconds instead of 12 seconds!

Thanks for reading this far. All feedbacks are welcome. If you like to “optimize” with us, you could apply for our open positions.

Thanks to Emre Odabas for his support and encouragement in writing this article.

--

--