Optimizing Kafka Performance Through Data Compression

Murat Özkanlı
Trendyol Tech
Published in
7 min read1 day ago

In distributed messaging systems like Confluent Kafka, data compression is a vital tool for improving system efficiency and overall performance. Managing such a large infrastructure brings great responsibilities. That’s why we try to implement best practices across our clusters from the experiences we’ve gained. We conducted a study comparing the compression metrics that we considered using in our application, based on the question of how to improve overall server and client performance while optimizing data flow. In this article, we will discuss the details of that study.

As the Trendyol messaging team, we manage the messaging platforms used by all domain teams. In doing so, we strive to offer our customers an end-to-end automated experience. Currently, we maintain over 40 Kafka clusters across three different data centers. Petabytes of data flow through the systems we manage every day.

To take a closer look at what we do, you can also check out the article written by my colleague Mehmetcan Güleşçi, titled “Enhancing Kafka Cluster Integrity: Introducing the Topic Creation Policy,” which introduces a best practice for solving misconfiguration issues at our Confluent Kafka Cluster.

My colleague Taner Sayım and I decided to implement compression for *stock transfer application* before deploying to the production environment. There are two methods for implementing compression: on the client side or the server side. Since this cluster contains thousands of topics and there is no general rule enforcement, we apply compression on the client side. This also helps prevent potential issues that may arise from a single application.

Introduction

Kafka’s ability to handle high data throughput and massive volumes of information makes it ideal for real-time applications where speed and efficiency are critical. By compressing messages, Kafka reduces the size of data transmitted and stored, which significantly impacts network bandwidth, storage needs, and overall system performance. This optimization allows Kafka to handle more messages without overloading the network, thus preventing congestion and bottlenecks, which is crucial when network resources are limited or expensive.

Compression not only reduces network load but also improves disk I/O performance. With smaller messages, more data can be read from or written to the disk faster, enhancing the throughput of Kafka brokers. This becomes especially important in large-scale data ingestion scenarios or real-time analytics applications such as high-frequency trading systems or log aggregation. Moreover, reducing the data size translates into lower storage costs, as less disk space is needed to store the same amount of information.

Compression also plays a key role in Kafka’s replication process, ensuring data reliability and fault tolerance by replicating messages across brokers. Compressed data reduces the load on the replication process, speeding up data consistency across the cluster and ensuring high availability even during broker failures.

Various compression algorithms, such as Gzip, Zstd, Lz4, and Snappy, provide different trade-offs between compression ratios, speed, and resource usage. Choosing the appropriate compression algorithm can balance the benefits of data size reduction with the computational overhead of compression and decompression. The right choice of algorithm and level can significantly affect the performance of both Kafka producers and consumers.

Problem Definition

The challenge lies in selecting the most appropriate compression algorithm for a Confluent Kafka-based system. The goal is to optimize both speed and compression efficiency while minimizing the impact on system resources. Each algorithm comes with unique advantages and drawbacks. For example, Gzip offers high compression ratios but is CPU-intensive, while Snappy offers faster compression at the cost of lower compression efficiency. Balancing these trade-offs is essential to maximize Kafka’s performance and scalability.

Given that Kafka is often deployed in environments where real-time data processing and low latency are critical, finding an optimal compression configuration becomes even more important. Additionally, Kafka’s role as a high-throughput messaging system places significant pressure on network and disk resources, making efficient compression a necessity for maintaining performance under heavy loads.

Approach

To address this, we conducted a detailed performance benchmarking of different compression algorithms at various levels, including Gzip, Zstd, Lz4, and Snappy. The benchmarks focused on measuring average compression duration, CPU usage, memory consumption, and compression ratios under realistic conditions. Our testing involved a Kafka topic used in a logistics system, with a message size of 1 MB and a produce count of 1,000.

Below is the summary of our benchmark results:

“Comparative analysis of compression algorithms through performance testing”

Benchmark Analysis

  • Gzip: Showed a high compression ratio but had the highest CPU usage and longer compression times. Increasing the compression level further increased the resource usage, making it more suitable for environments where compression efficiency is more important than speed.
  • Zstd: Offered the highest compression ratio at higher levels while maintaining a reasonable balance between CPU usage and memory consumption. This algorithm was most effective in environments requiring high compression ratios with moderate resource usage.
  • Lz4: Provided faster compression times and lower CPU usage but delivered lower compression ratios compared to Gzip and Zstd. Lz4 may be more appropriate in scenarios where speed is prioritized over compression efficiency.
  • Snappy: Achieved the fastest compression times with minimal CPU usage but had the lowest compression ratio. Snappy is best suited for applications where speed and low CPU impact are more critical than compression efficiency.

From the benchmark data, Zstd at compression level 3 was selected as the most optimal choice for our application. This level offered a high compression ratio while keeping resource consumption, particularly CPU and memory, at manageable levels. As a result of this study, a success of around 70% message size was achieved. This success rate may vary depending on the message type. The application we use for this task is a .NET5 application. We use the Confluent.Kafka NuGet package for handling Kafka-related operations, and the version we use is 2.3.0.Below is a code snippet illustrating how we implemented Zstd compression in our client:

Sample message .json format:


{
"Array1": [
{
"Field1": 20446,
"Field2": 20245,
"Field3": "b2a068a1-deb6-459b-a38b-58cfcc9c731d"
},
{
"Field1": 16070,
"Field2": 19474,
"Field3": "ad2b0f29-c26a-41ff-8b95-e0ccac0f71eb"
}
],
"Array2": [
{
"Field1": 1255014359,
"Field2": "Code6efc136e-0e0a-47cd-b73f-094229935131",
"Field3": 1538404700,
"Array3": [
{
"Field1": 843614680,
"Field2": 1259663652,
"Field3": "7747a69b-63b6-4e2f-a393-7002fbd883fa"
},
{
"Field1": 1634694303,
"Field2": 2115776113,
"Field3": "c421e3c3-388c-47fc-a6fb-36ec7212fab0"
}
]
},
{
"Field1": 673076286,
"Field2": "ec04718e-98cb-419e-8b41-6a1d4937728c",
"Field3": 1074623070,
"Array3": [
{
"Field1": 966769826,
"Field2": 613694778,
"Field3": "fa7c560b-5bcd-41f3-8d77-ecfdf57225c4"
}
]
},
{
"Field1": 1431764173,
"Field2": "5a52522e-a4cb-4570-97d0-21cf259b3e2e",
"Field3": 2041344137,
"Array3": [
{
"Field1": 1444531729,
"Field3": "3220562b-2131-4811-bf6c-8a753431aab4"
}
]
}
],
"Field10": 1669373932,
"Field11": "072ce457-26ae-4179-87ce-93222358ab3e",
"Field12": 1959699302,
"Field13": 1617737239
}
public async Task Produce(string topic, string key, string value)
{
var config = new ProducerConfig
{
BootstrapServers = "servers...",
CompressionType = CompressionType.Zstd,
CompressionLevel = 3
};

using var producer = new ProducerBuilder<string, string>(config).Build();

await producer.ProduceAsync(topic, new Message<string, string>
{
Key = key,
Value = value
});
}
At time t before Compression
At time t after Compression

Kafka client that performs compression and a client that performs decompression do not need to use the same version. Kafka supports backward compatibility for compression and decompression, so clients using different versions can still communicate with each other.

However, there are a few important points to consider:

1. Supported Compression Algorithms: Both clients must support the same compression algorithm (e.g., gzip, snappy, lz4, zstd). If one client doesn’t support a specific algorithm, it won’t be able to decompress the data.

2. Compatibility: As long as Kafka’s backward compatibility policies are followed, different Kafka client versions should be able to handle compressed data correctly.

Conclusion

When optimizing Kafka for real-time, high-throughput environments, selecting the right compression algorithm is crucial. Our benchmarking led us to choose Zstd at level 3, which delivered a good balance of compression efficiency and resource usage. However, it’s essential to recognize that each compression algorithm comes with trade-offs between speed, compression efficiency, and resource demands.

Compression is a powerful way in Kafka to ensure scalability, reliability, and performance. Still, one should remember that Kafka excels at data motion and real-time processing — not long-term data storage or batch processing. Selecting the right compression strategy will enhance Kafka’s ability to handle larger datasets more efficiently, reduce operational costs, and maintain high performance under load.

“ Kafka is not meant for archiving data, is not batch processing”

About Us

Want to know more? Or contribute by researching it yourself?

We’re building a team of the brightest minds in our industry. Interested in joining us? Visit the pages below to learn more about our open positions.

--

--