Bench marking standards for Kafka compression algorithms

Prasanta Kumar Mohanty
4 min readAug 10, 2020

--

Kafka Producer

Before we begin understand the compression performance on Kafka producers, lets first understand how Producers are working inside Kafka.

Producer Record

A message that should be written to Kafka is referred to as Producer Record. A producer record should have the name of the topic it should be written to and value of the record. Other fields like partition, timestamp and key are optional

Work Flow

The workflow of a producer involves five important steps

  1. Serialize : Both key and value are serialized based on the serializer passed. Some of the serializers include string serializer, byteArray serializer and ByteBuffer serializers
  2. Partition : Decides which partition of the topic the message will be written , by default follows murmur2 algorithm.Custom partitioner can also be passed to the producer to control which partitions message should be written to.
  3. Compress : By default, compression is not enabled in Kafka producer.Compression enables faster transfer not only from producer to broker but also during replication. Compression helps better throughput, low latency, and better disk utilization.
  4. Accumulate records: The records are accumulated in a buffer per partition of a topic. Records are grouped into batches based on producer batch size property. Each partition in a topic gets a separate accumulator/buffer
  5. Group by broker and send(Sender Thread) : The batches of the partition in record accumulator are grouped by the broker to which they are to be sent. The records in the batch are sent to a broker based on batch.size and linger.ms properties. The records are sent by the producer based on two conditions. When the defined batch size is reached or defined linger time is reached.

Compression plays a vital role in the performance of Kafka messages.

Prerequisites

This bench marking is done on below hardware and software

  1. kafka_2.11–1.0.0
  2. Java8
  3. 4 core (Intel(R) Core(TM) i5–4300U CPU @ 1.90GHz) , 8 Gb , Ubuntu 18.04 system
  4. STS 3 running on 512 mb memory

Bench marking Process

  1. Clone the following repository
git clone https://github.com/prasantmohanty/kafka-compression-benchmark

2. Run your zookeeper

./zookeeper-server-start.sh ../config/zookeeper.properties

3. We have the server properties available under src/main/resources . Run the broker the following commands

./kafka-server-start.sh src/main/resources/server-0.properties ./kafka-server-start.sh src/main/resources/server-1.properties ./kafka-server-start.sh src/main/resources/server-2.properties

4. Create the topic

./kafka-topics.sh --create --zookeeper localhost:2181  --replication-factor 3 --partitions 3 --topic stock-prices --config  min.insync.replicas=2

compression with none

  • We modify setupBatchingAndCompression method of StockPriceKafkaProducer class to use no compression.
Compression type with none
  • Run the SimpleStockPriceConsumer on STS and right this run StockPriceKafkaProducer. You will see the below kind of metrics logs
compression metrics with none

compression with snappy

  • Stop the StockPriceKafkaProducer and change the compression type to snappy and run the java process again
// Use Snappy compression for batch compression.
props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "snappy");
  • capture the metrics logs
compression with snappy

compression with gzip

  • Stop the StockPriceKafkaProducer and change the compression type to gzip and run the java process again
// Use Snappy compression for batch compression.
props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "gzip");
  • capture the metrics logs
compression with gzip

compression with lz4

  • Stop the StockPriceKafkaProducer and change the compression type to lz4 and run the java process again
// Use Snappy compression for batch compression.
props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "lz4");
  • capture the metrics logs
compression with lz4

Bench marking of metrics

When we compare all compression algorithms , we found the below result

Compression comparison

For my hardware and kafka version , I see compression benefit of 3X with snappy and lz4 . However with gzip we got benefit of 4.5X . With snappy I see incoming-byte, response-rate and request-size-max increases . Also I have not considered the cpu utilization when compression happening . CPU plays a major role in identifying which algorithm to be used .

We have a new compression algorithm added i.e Zstandard which I will cover in next story.

#Disclaimer

JavaSource is collected from cloudurable github site.

--

--

Prasanta Kumar Mohanty

Engineering | SRE | Highly distributed scalable architecture