Our Approach to Processing 100 Million+ Events Daily

Emre Arslan
Trendyol Tech
Published in
9 min readApr 14, 2023

As the Seller Hunt team, we build projects that aim to boost sales for sellers by offering them options related to their products. In these projects, we provide sellers with product data, filters and sorting options to help them find the right products. We aim to make the process seamless and efficient for sellers, allowing them to maximize their sales and grow their businesses.

Product Search Project

Figure 1: Data flow of the product search project

Additionally, we have a product search project where we listen to product events and the events of the data we use to increase sales on the screens. We write these events to Couchbase, which has a CBES (Couchbase Elasticsearch Connector) connected to it. During campaign periods, the number of events can reach 200 million and throughput can reach 700k due to heavy traffic.

Figure 2: Count of events per day
Figure 3: Product consumer throughput

Efficient resource utilization is critical to any software system, particularly in high-traffic environments where even the smallest inefficiencies can significantly impact performance. We recognize the importance of using resources efficiently, so we strive to develop systems that operate with minimal resource usage while still delivering outstanding performance. Our low-memory and CPU usage and well-tuned configurations in our Kafka setup allow us to process a high volume of events with minimal impact on system performance. By prioritizing efficient resource usage, we ensure that our systems are reliable and scalable, which is critical for meeting the demands of our users.

Figure 4: Average CPU and memory usage

Currently, our product consumer application runs 8 pods. Our average memory usage for these pods is 900MB and our average CPU usage is 150m.

So what did we do to get to this point?

  • Optimizing in the Kafka consumer configuration.
  • Using reactive programming with Kotlin.
  • Using a Java 17 Temurin.

These factors have enabled us to develop a high-performing, low-resource application for processing millions of product events. In the following sections, we will explore these factors in more detail and explain how they contribute to our conclusion.

Part I: Kafka Consumer Configuration

Kafka is a distributed streaming platform that provides scalable, high-performance, and fault-tolerant data processing. Consumer configuration settings such as the number of consumer instances, maximum poll interval and concurrency count can greatly impact the overall performance and resource usage. Optimizing these configurations makes achieving high throughput and low latency data processing possible while minimizing resource usage.

Understanding the Relationship between Kafka Partitions and Consumer Scaling

In Kafka, each topic is divided into one or more partitions, individual units of parallelism that allow for the concurrent processing of messages. Consumers in Kafka subscribe to one or more partitions of a topic and read messages from those partitions. The number of partitions in a topic determines the maximum number of consumers that can effectively read from the topic.

Figure 5: Each consumer is assigned to 2 partitions

If there are fewer consumers than partitions, some consumers may be idle while others are overloaded. On the other hand, if there are more consumers than partitions, some consumers will be left with no partitions to read. Therefore, it is important to consider the number of partitions when designing a consumer application, considering the expected message throughput and the desired level of parallelism.

Figure 6: Each consumer is assigned to 1 partition (recommended)

In general, it is recommended to have at least as many partitions as there are active consumers and to avoid having significantly more consumers than partitions. Additionally, the partition assignment strategy used by the consumer group can affect the distribution of partitions among consumers. It should be chosen based on the specific requirements of the application.

How do we approach the relationship here?

Initially, we might think that having 30 active consumers for a topic with 30 partitions would result in better performance. However, this approach has some issues.

1- Not all topics we listen to have 30 partitions, so there will be many idle consumers for topics with fewer partitions.

2- Having too many pods increases the amount of resources required.

3- Thinking that the average number of pods should be 15, assigning two different partitions to the same pod can cause performance issues.

To address these issues effectively, we can use the concurrency feature found in Kafka libraries. We use Spring Kafka and its default concurrency value is 1. By setting the concurrency value to 3, our Spring Boot application can concurrently assign to and run on three different partitions using three threads. But then, we might think setting concurrency to 30 and running with just one pod would be the best option.

Figure 7: Balance is important :)

However, finding the right balance between concurrency, resource request limit values and pod numbers is crucial. If the concurrency value is too high, it can have a negative impact on both performance and resources. Therefore, you need to find the correct values based on our needs and perform load testing and a few trials to determine the most appropriate values.

Currently, our product consumer runs 8 pods with the average resource in Figure 4.

Figure 8: Average response time of product consumer

Kafka Consumer Settings: Maximizing Performance and Reliability

Configuring Kafka consumer properties correctly is important for achieving good performance. There are a few important properties to consider, such as max.poll.records and max.poll.interval.ms.

The former determines the number of records a single poll() call returns. If this value is set to 1, the consumer must make a poll() call for every message, which can significantly impact performance. The default value in Java Kafka Client is 500. The latter sets the maximum amount of time between poll() calls. If this time limit is exceeded, the consumer is considered to have failed and triggered a rebalance. The default value in Java Kafka Client is 300000 (5 minutes).

If the consumer isn’t processing 500 messages within 5 minutes, it constantly triggers rebalances, negatively impacting performance. Therefore, balancing these two properties to achieve optimal performance is important.

In addition to these two properties, there are also heartbeat.interval.ms and session.timeout.ms. The heartbeat.interval.ms property sets the frequency of heartbeat signals sent by the consumer to the broker. The default value in Java Kafka Client is 3000 (3 seconds), meaning the consumer sends a heartbeat signal to the broker every 3 seconds. The session.timeout.ms property sets the time the broker waits for the consumer to send a heartbeat signal before considering it failed. The default value in Java Kafka Client is 45000 (45 seconds).

In a high-load application, it’s normal for a few heartbeats signals to be missed. Therefore, waiting for at least three heartbeat signals is recommended before considering consumer failure. Suppose the two values, heartbeat.interval.ms and session.timeout.ms, are set to the same value. In that case, the heartbeat signal may be missed under high load, causing the consumer to trigger constant rebalances, which can negatively impact performance.

So how do we use it?

open class KafkaConsumerProperties {
var concurrency = 1
var enabled = true
var autoOffsetReset = "earliest"
var maxPollRecords = 500
var maxPollIntervalMs = 300000
var sessionTimeoutMs = 10000
var heartbeatTimeoutMs = 3000
}

open class KafkaRetryProperties : KafkaConsumerProperties() {
var backoffInterval = 50
var backoffMaxAttempts = 2
}

@ConstructorBinding
@ConfigurationProperties(prefix = "kafka")
class KafkaProperties {
var bootstrapServers: String? = null

@NestedConfigurationProperty
var retry = KafkaRetryProperties()

@NestedConfigurationProperty
var error = KafkaConsumerProperties()

@NestedConfigurationProperty
var fooConsumer = KafkaConsumerProperties()

@NestedConfigurationProperty
var barConsumer = KafkaConsumerProperties()
}

Taking into account the default values, we have set our own default values. We check the default values for each topic and modify some properties in the properties file according to that topic. You can see this properties structure from the above code snippet. Additionally, since the number of products and events keeps growing, we set up various alarms and monitor them to ensure these values remain appropriate.

Part II: Why do we use Kotlin?

Figure 9: How do Kotlin Coroutines work?

In recent years, coroutine and reactive programming have become popular paradigms in software development, particularly in the realm of high-performance, event-driven systems. Coroutines allow developers to write asynchronous, non-blocking code more concisely and readably. Coroutines provide a way to write code that can suspend and resume execution at specific points, allowing for more efficient use of system resources and better performance. This makes writing code that can handle multiple concurrent tasks easier without complex thread management.

So let’s first look at the example of callback hell.

getUserInfo(new CallBack() {
@Override
public void onSuccess(String user) {
if (user != null) {
System.out.println(user);
getFriendList(user, new CallBack() {
@Override
public void onSuccess(String friendList) {
if (friendList != null) {
System.out.println(friendList);
getFeedList(friendList, new CallBack() {
@Override
public void onSuccess(String feed) {
if (feed != null) {
System.out.println(feed);
}
}
});
}
}
});
}
}
});

We can write the above code much simpler way with Kotlin as follows.

val user = getUserInfo()
println(user)
val friendList = getFriendList(user)
println(friendList)
val feedList = getFeedList(friendList)
println(feedList)

Suspend Function

Suspend functions in Kotlin are functions that can be suspended and resumed at a later time without blocking a thread. They allow for asynchronous programming without the need for callbacks or separate threads, making code easier to read and write. When a suspend function is called, it executes its code until it encounters a suspension point, which can be a call to another or coroutine builder. At this point, the suspend function suspends and its execution can be resumed later. This makes it possible to write efficient and scalable concurrent code without the overhead of creating and managing threads.

// delay(1000L) representing request to server

suspend fun getUserInfo(): String {
withContext(Dispatchers.IO) {
delay(1000L)
}
return "BoyCoder"
}

suspend fun getFriendList(user: String): String {
withContext(Dispatchers.IO) {
delay(1000L)
}
return "Tom, Jack"
}

suspend fun getFeedList(list: String): String {
withContext(Dispatchers.IO) {
delay(1000L)
}
return "{FeedList..}"
}

If we try to understand the suspend functions with a visual, let’s first look at the 3 functions in the code example above. When we call these functions, we can see from the animation below how the Kotlin Corutines switch between threads. When a coroutine is suspended, it doesn’t block the thread it’s running on, allowing that thread to be used for other tasks. This means suspending functions can help reduce the threads needed to handle a high load, leading to more efficient resource usage.

Figure 10: Animation of how the Kotlin Coroutines switch between threads

As another benefit, Kotlin, Spring Boot and Spring WebFlux offer a powerful stack for building high-performance, reactive systems. Kotlin’s seamless integration with the Java ecosystem allows developers to use familiar libraries and tools. At the same time, the combination of Spring Boot and Spring WebFlux provides a robust and flexible platform for building reactive applications. By combining these technologies, developers can achieve high-performance systems that are both efficient and easy to maintain.

In conclusion

Firstly, we achieved high throughput with low latency and efficient resource usage by properly configuring the Kafka consumer configurations.

Next, Kotlin Coroutine offers powerful tools for managing concurrency and asynchronous tasks. It provides a versatile and efficient way to build high-performance, event-driven systems. By using Kotlin Coroutines, Spring Boot and Spring WebFlux, developers can achieve a high-performance stack without sacrificing familiarity or ease of use.

Overall, our approach to processing 100 million events daily has been successful. We will continue to improve and iterate on our architecture to ensure the reliability and scalability of our data processing pipeline.

Want to work in this team?

Do you want to join us on the journey of building the e-commerce platform that has the most positive impact?

Have a look at the roles we’re looking for!

--

--