Experiments in Kafka Compression

Ralph Brendler
project44 TechBlog

--

Our use of Kafka at Project44 has continued to expand over the past several years, to the point where a large portion of our product depends on it. This increased usage has come with an increase in the infrastructure required to run it, which has in turn caused us to look at ways of minimizing the associated storage, network, and CPU costs.

One of the most fruitful avenues we have pursued to reduce our infrastructure cost was the use of compression in Kafka. I’ve used this technique occasionally in the past, but Project44 has some topics dealing with fairly large messages that would especially benefit from this.

The theory here is that we can compress Kafka payloads, thus trading some CPU load for a reduction in network and storage costs. The questions we need to answer are:

  • Will compression be able to keep up with our high-volume topics?
  • Will the increase in CPU outweigh the disk/bandwidth reduction?
  • What configuration is the most effective?

In order to answer these questions, we performed a series of experiments.

Configuring Kafka Compression

Kafka supports four kinds of compression, offering lots of options for CPU/Compression tradeoffs:

Kafka Compression Types

Kafka also supports specifying compression.type in several places:

  • At the topic level, where compression is done by the “leader” broker. This means that there is no CPU impact on the producer, but only replication and disk storage will be compressed (not the original message).
  • At the producer level, where the client compresses the data before sending. This will obviously increase CPU on the client, but the initial payload, replicas, and disk will all be compressed.
  • At the broker level, which will cause all disk storage to be compressed, regardless of the topic/producer configuration.

The default for broker and topic compression is producer, which means that the data will be compressed according to the configuration of the producer.

Experiment 1 — Establish Disk Space Baselines

Now that we know what our options are, the next step is to do some experiments to examine the tradeoffs. Our first experiment was to establish a baseline for broker disk storage in various configurations.

The test setup was a single broker running locally, using a topic with a single partition and replication factor of 1. The Kafka CLI tools were used for sending and receiving messages. This setup takes replication out of the picture and focuses on broker CPU and disk usage.

Broker Storage Baselines

Note that test case 6 produced messages too large for Kafka to handle, rendering the results incomparable.

Takeaways from Experiment 1

Broker disk space:

  • Whenever compression is in play, we reap the benefit of reduced disk space utilization. This is true even when Producer Compression is on and there is no Compression on the Topic (see Test Cases #3 and #4).
  • Matching apples to apples, the disk space savings is essentially same whether Producer Compression, Topic Compression, or both for the same codec (see Test Cases #2 vs #3 vs #5)
  • Note that we still reap the benefits of Broker disk space savings even though Topic is NOT compressed, the Broker writes compressed messages as received.

Broker CPU Usage:

  • When Topic has a compression specification that doesn’t match the Producer’s, the broker has to spend CPU to do message conversion.
  • Examine Case #2 (Only Topic Compression): Since Producer is sending raw messages, the Broker spends the CPU to compress.
  • Examine Cases #6 and #7: Mismatch in 2 compression codecs. Wasted CPU cycles since Producer compresses, Broker uncompresses AND recompresses in the different codec set for the Topic.

In general, our first experiment shows that the CPU increase on the brokers due to compression is not significant when dealing with our typical messages. This is not surprising, as our Kafka brokers are high-resource machines designed to handle high load. This result may not hold up with client-side compression, however, since our clients tend to run on much smaller machines.

After additional baseline experiments, we decided to do our further tests using SNAPPY compression, with a combination of topic and producer compression.

Experiment 2 — Java Clients

Our second experiments used the same basic setup, but instead of using the Kafka cli tools, we create a simple java-based producer to test our configuration. In this scenario we were able to use the Kafka JMX metrics to get a much better idea of what was going on internally.

Producer Metrics
Broker Metrics

Takeaways from Experiment 2

The broker metrics collected in experiment 2 track well with our original baselines, but the producer and consumer metrics give us much more insight into the details of the processing.

  • Producer compression only takes effect if the producer is configured for compression. If only topic compression is enabled, we do not gain any benefits on the client side.
  • Consumers of compressed topics get a dramatic reduction in bandwidth. We saw an average bandwidth of just 16% compared to the raw topic size.

Experiment 3 — Load Test on Live Cluster

Given what we learned in our controlled experiments, we decided to move forward with “real world” testing on one of our QA stacks. Our final configuration was:

  • broker compression: producer
  • topic compression: SNAPPY
  • producer compression: SNAPPY

We selected a handful of topics related to our Collaborative Visibility workflow and configured the topics and producers to use SNAPPY compression. Once everything was deployed, the results were immediate and dramatic:

Our bandwidth dropped by about 75% in the matter of a few minutes! Our APM indicated that our latency was up very slightly (likely due to compression overhead) but was still quite reasonable and within our limits.

Once we were sure that the system was working properly, the final test was a stress test of the CV system. We stopped the services, rewound the kafka offsets by 72 hours, and restarted everything — this had the overall effect of 3 days' worth of data showing up in a single instant.

As you can see from the charts above, the system behaved perfectly. We processed more than 10K messages/min and cleared the 3-day backlog in around 2 hours. Our production CV setup averages about 7K messages per minute, so even our underpowered QA cluster was able to top that with compression enabled!

The Bottom Line

After all of our experiments, we decided to move ahead with enabling compression on all of our clusters. We started with a gradual rollout to some high-volume topics and based on our success there decided to do a global rollout to all topics. Results were immediate and dramatic:

  • 70% reduction in disk storage on our brokers
  • 60% reduction in network bandwidth in/out of Kafka
  • No change in expected performance

Our brokers have been running in this configuration for over a year now without any compression-related incident. We have been able to scale down several of our highest-volume clients, and plan to scale down some of our largest Kafka clusters thanks to the reduced disk and bandwidth needs.

Acknowledgements

A huge thanks go out to our CV team for spearheading this effort, and particularly my colleague Hereen Oh for her relentlessly data-driven approach to Kafka configuration.

--

--

Ralph Brendler
project44 TechBlog

Principal software engineer with a long history of startup work. Primary focus is currently on scalability and distributed computing.