Kafka @ Trendyol Tech. — Under the hood
At Trendyol we live by data, it is data that drives us. That guides us, it is data that defines us.
In this series we would like to share our design choices around Apache Kafka, starting with underlying technologies which may overlooked but also render whole architecture a more robust system.
Intended audience is the operators who want to optimize every bit transferred.
Notice: This is not a how-to tutorial, it should be interpreted as a guide, we assume readers to have basic Linux kernel knowledge.
Recipe for busy “data in motion” platform
Kafka clusters at Trendyol are super busy, doing ingesting, replicating, serving and streaming of the billions of events between micro services across data centers and cloud on daily basis.
Simplifying Kafka as a “storage system” may not be seen fair, although at its core level it is an almost perfect distributed storage system. Parallelism is done via partitions, page cache is utilized for rapid access of the fresh data and with native replication the data is durable and high available.
It is best to treat Kafka as a storage system rather than a traditional message queue system when it comes to architectural decision making.
Let’s take a look at our recipe for creating each Kafka Cluster, we preliminary start by optimizing the OS’ kernel and carefully selecting configuration parameters for the task.
Main ingredients are:
- A modern Linux kernel
- Kafka
- Sophisticated configuration
Kernel
We may qualify Linux kernel version 4 and above as a modern kernel. Version 4.9 introduced at 2016–12–11 and it is projected that it will reach its end-of-life around Jan, 2023, so you will be better not chose any kernel below that version.
Default values of the kernel parameters are not tuned for every hardware/software combination or for specific use case, it is just there for off the shelf hardware and general purpose usage, but these parameters can be easily fine tuned for the systems like Kafka that do heavy network traffic.
One can benefit considerably from tuning.
Remember! There is no one-size-fits-all, below you will find example configuration and you should make your own research, these values are not meant that they will cover your specific needs.
net.core
Increasing default read/write buffers for Linux networking subsystem will allow Kafka to perform more efficiently. Default values for these settings varies based on the amount of memory in a system and since Kafka transfers huge amount of data both between brokers and clients, these values are needed to be increased.
net.core is documented here. You may find interesting to go into details at ArchWiki about changing kernel parameters at runtime.
# Receive Buffer
net.core.rmem_max=4194304
net.core.rmem_default=4194304# Send Buffer
net.core.wmem_max=4194304
net.core.wmem_default=4194304net.core.optmem_max=4194304
net.core.netdev_max_backlog=250000
Notice that Linux has several different scheduler implementations for each specific hardware resource (I/O, network, CPU), new ones are being introduced and existing ones are being improved in timely fashion.
We replace default package scheduler (pfifo_fast) with fq. In conjunction to use with bbr congestion control algorithm.
# kernel > 3.12
net.core.default_qdisc=fq
net.ipv4
Kafka serves at layer 4 of the OSI model and uses a binary protocol over TCP. Kernel parameters for IP can be tuned for instance. net.ipv4 is documented here.
net.ipv4.tcp_timestamps=0
net.ipv4.tcp_sack=1
net.ipv4.tcp_rmem="4096 87380 4194304"
net.ipv4.tcp_wmem="4096 65536 4194304"
net.ipv4.tcp_adv_win_scale=1
net.ipv4.tcp_max_orphans=60000
net.ipv4.tcp_no_metrics_save=1
net.ipv4.tcp_window_scaling=1
net.ipv4.tcp_max_syn_backlog=10240
net.ipv4.tcp_synack_retries=2
net.ipv4.tcp_rfc1337=1
net.ipv4.tcp_fin_timeout=15
net.ipv4.tcp_max_tw_buckets=1440000
net.ipv4.tcp_tw_reuse=1# kernel > 4.9
net.ipv4.tcp_congestion_control=bbr
tcp_timestamps > Disabling TCP timestamps is a trade-off, omitting it renders TCP packages to lose timestamp information which is required to validate out of order packages’ sequence number on the other hand it reduces some bits that is transferred with every network package
tcp_sack > Selective Acknowledge introduces additional package for endpoints (network connected devices) to describe which package they have received so far. This way only the needed packages are re-transmitted when they are lost.
tcp_<r|w>mem > Receive/Transmit buffer. (ordered as minimum, default and maximum size)
tcp_adv_win_scale > Okay this one is a little tricky, quoting from tcp man page:
Count buffering overhead as
bytes/2^tcp_adv_win_scale
, iftcp_adv_win_scale
is greater than 0; orbytes-bytes/2^(-tcp_adv_win_scale)
, iftcp_adv_win_scale
is less than or equal to zero.The
tcp_adv_win_scale
default value of 2 implies that the space used for the application buffer is one fourth that of the total.
tcp_max_orphans > Orphaned connections (not attached to any user file handle) are reset immediately after reaching this setting, we handsomely give more than needed room for it, roughly ~4 GB since each orphan can eat up to ~64 kB of unswappable memory. Our infrastructure continuously tested for scalability and heavily load tested and Kafka clusters are no different, it may struct as denial-of-service attacks, increasing the limit results that the cluster can survive from these tests.
tcp_no_metrics_save > Some metrics like routing information is cached for closing connections in the hope of future connections can be speed up with this cache. With local connections we think benefits are lost and keeping this cache has no use.
tcp_window_scaling > This is enabled by default and modern OSes are configured to have this scaling factor. Details are defined in RFC 7323.
tcp_max_syn_backlog > Number of remembered connections which have not yet received an acknowledgement from connecting client. There are multiple factors for why a connecting client not sending an ack, it is advised to set it a higher value for high connection servers.
tcp_synack_retries > Quoting from frozentux.net
This setting tells the kernel how many times to retransmit the SYN,ACK reply to an SYN request. In other words, this tells the system how many times to try to establish a passive TCP connection that was started by another host.
tcp_rfc1337 > This setting solves some of the SYN replay, TCP sequence problems defined in RFC 1337
tcp_fin_timeout > Seconds to wait for a FIN package before forcibly closing the connection.
tcp_max_tw_buckets > The maximum number of sockets in TIME_WAIT state allowed in the system, used for preventing simple denial-of-service attacks and some tests alike.
tcp_tw_reuse > Enabling this setting cause allowance on system to reuse sockets that was in TIME_WAIT state for new connections.
tcp_congestion_control >
We select BBR congestion algorithm as TCP congestion control, since it is a well-known tech within our team and already deployed and battle-tested at Trendyol’s CDN infrastructure and we find it more convenient have it within our tech stack. There are other alternatives available for TCP congestion control and BBR already has its successor, the BBRv2. We stick with the initial version simply because new version solves the issues that does not affect our current setup.
BBR requires kernel version 4.9+.
vm
Virtual Memory subsystem for Linux kernel “must” be tuned for any production grade Kafka cluster.
vm.swappiness=1
vm.dirty_background_ratio=5
vm.dirty_ratio=80
vm.max_map_count=262144
swappiness > This setting tells system to how aggressively use swap memory when needed. We tell it to use as minimal as it can, so Kafka does not get OOM while such memory usage decrease performance.
dirty_background_ratio and dirty_ratio > Kafka heavily relies on disk I/O performance. These parameters control how often dirty pages are flushed to disk. Higher vm.dirty_ratio results in less frequent flushes to disk.
max_map_count > maximum number of memory map areas a process may have. Kafka needs plenty of them.
filesystem
We chose XFS over other filesystems out there, just set noatime
mount option when mounting the data volume and does not share it with other processes, only the Kafka should use it.
It basically disables storing last “access time” of the file, this metadata is not used by Kafka so OS does not have to bother doing book keeping of the accessed data.
Kafka also requires large amount of open files. By default %10 of your system’s memory is the maximum limit for the number of open files.
fs.file-max=1000000
Conclusion
In this very first article we have discovered possible kernel tuning parameters and remember, there is always room for improvement.
In the next article we will continue with Kafka configuration parameters, until then we would like to hear from you if you have other useful optimizations, feel free to comment or give feedback.