Things nobody will tell you setting up a Kafka cluster

  • Set vm.swappiness to the minimum possible value for your server type for both Kafka brokers as well as the zookeeper servers.
    Swappiness is basically Linux kernel's tendency to copy RAM contents to swap. For latency critical application such as Kafka, it's very important that unnecessary swapping of memory contents doesn’t affect its performance. You can go here to know how to do it in Ubuntu.
  • Increase the file descriptors limits to a very high value as much as 100,000.
    Kafka uses a very large number of open files when communicating with Kafka clients and the default limit per process is very low in most Linux distributions. Depending on the way you are starting your Kafka broker process, you may require different ways to increase this limit. We suffered a Kafka broker failure as the limit set was not applied to our Kafka process. Check this thread to know why. It's best to make sure that the limit is indeed applied to your process using the steps below:
    ps ax | grep <Name of the Kafka process>
    cat /proc/{{process_id}}/limits | grep "Max open files"
  • Give at least 32 GB of RAM to your Kafka brokers.
    Kafka brokers may not require much heap as it uses memory very efficiently (maximum 5 GB), but it very well leverages Linux kernel’s page cache to keep data in RAM, instead of flushing messages to disk immediately. So typically most of your RAM would be utilized by Kafka indirectly through the kernel’s Page Cache. To know how much RAM is currently used as page cache check out the buffer/cache column of the report produced by the free -m -h command in your terminal.
  • Use multiple data log directories mounted on multiple disks.
    Kafka allows you to distribute data in multiple directories. Topic partitions are assigned to log directories in round-robin fashion during topic creation. Having multiple log directories increases parallelism and provides better throughput. Also, Increase the value of broker property num.recovery.threads.per.data.dir to at least the number of data directories for faster brokers shutdown and startup.
  • Do not share Kafka data drives with any other application or OS filesystem.
    A typical Kafka cluster deployed would have Kafka brokers mounted with multiple throughput optimized st1 HDD drives. These disks are specially optimized for large IO requests (at least 1 Mb) and generally perform very bad with a large number of small IO requests. Adding logs or any kind of other OS filesystem will affect it’s performance adversely. It's better to use SSD for such operations.
  • Garbage Collection taking too much time may indicate other problems with the cluster. Keep it in check.
    I personally struggled a lot with GC taking too much time on our Kafka brokers, sometimes as much as 20–25%. This terribly affected cluster’s performance as producers were failing to flush messages and consumers were lagging drastically even when the load was relatively low. Since this lag increased with increase in GC, I wrongfully mapped these two and started experimenting with GC configuration without finding the root cause. I felt that maybe the default GC settings were not tuned for a production environment and kept trying different configs without much benefit. Although the complete story of struggle with this issue requires its own post, but in a gist, the real problem was that our Kafka broker was spending too much time in disk IO, and the maximum throughput we were getting was just 400 Kbps in a 40 Mbps st1 disk. Many IO requests were getting queued and thus hogging the application. All because I didn’t understand the working of an st1 drive.
  • Choose the number of topic partitions carefully.
    There are tradeoffs of choosing the number of partitions for your Kafka topics. Apart from the points discussed in the article by confluent, keep in mind that topics which could be joined in a Kafka Streams application must be co-partitioned, which means that they must have the same number of partitions to work correctly. More on this here.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store