Choosing Right Partition Count & Replication Factor (Apache Kafka)

Published in

The Startup

4 min readSep 12, 2020

A Small Introduction to Kafka!

So before we learn about Kafka, let’s learn how companies start. At first, it’s super simple. You get a source system, and you have a target system and then you need to exchange data. That looks quite simple, right? And then, what happens is, that after a while you have many source systems, and many target systems and they all have to exchange data with one another, and things become really complicated. So the problem is that with a previous architecture

Each integration comes with difficulties around

Protocol — how the data is transported (TCP, HTTP, REST, FTP)
Data format — how the data is parsed (Binary, CSV, JSON, Avro)
Data schema & evolution — how the data is shaped and may change

Additionally, each time you integrate a source system with the target system, there will be an increased load from the connections

So how do we solve this? Well, this is where Apache Kafka comes in. So Apache Kafka, allows you to decouple your data streams and your systems. So now your source systems will have their data end up in Apache Kafka. While your target systems will source their data straight from Apache Kafka and so this decoupling is what is so good about Apache Kafka, and what it enables is really really nice. So for example, what do we have in Kafka? Well, you can have any data stream. For example, it can be website events, pricing data, financial transactions, user interactions, and many more.

Kafka Terminology

The basic architecture of Kafka is organized around a few key terms: topics, partitions, producers, consumers, and brokers. All Kafka messages are organized into topics. If you wish to send a message you send it to a specific topic and if you wish to read a message you read it from a specific topic. A consumer pulls messages off of a Kafka topic while producers push messages into a Kafka topic. Lastly, Kafka, as a distributed system, runs in a cluster. Each node in the cluster is called a Kafka broker. Kafka topics are divided into a number of partitions. Partitions allow you to parallelize a topic by splitting the data into a particular topic across multiple brokers. Each partition can be placed on a separate machine to allow multiple consumers to read from a topic in parallel.

Partitions Count, Replication Factor

The two most important parameters when creating a topic. These impact the performance and durability of the system overall.

It is best to get the parameters right the first time!

If the partitions count/replication factor increase during a topic lifecycle, it may lead to an unexpected performance decrease or data integrity

Basic guidelines for choosing a partition count

Partitions per topic is the million-dollar question and there’s no one answer. If you have a small cluster of fewer than six brokers, create two times the number of brokers you have ( N X 2 if N < 6). The reasoning behind it is that if you have more brokers over time, maybe if the size of your cluster is12 brokers. Well, at least you will have enough partitions to cover that. For suppose if you have a big cluster over 12 brokers, I would say it’s not necessary to go 2 times the number of brokers, but you can still follow. What if your medium is between 6 and 12 brokers. Well, you just figure it out. Maybe you want two times the number of brokers, maybe you want one time. It’s up to you. Now the idea is that you also want to take into account the number of consumers you need to run in parallel-group at peak throughput. So if your consumers are going to be very c.p.u intense and need 20 consumers at peak time where you definitely need at least 20 partitions in your topic, regardless of how big or how small your cluster. Coming to the producer side you also need to adjust the producer throughput So, if you have like a super high throughput or It’s going to increase heavily for the next couple of years, regardless if you have a smaller big cluster, better do it 3 times the number of brokers so you can plan and adjust in the future. Lastly don’t create a topic with 100 partitions which is useless, unless you have a very good reason.

Basic guidelines for choosing a replication factor

Should be at least two, usually three(recommended), maximum four. So the higher the replication factor, we’re going to get better resilience of your system. If the replication factor gets an issue eg: performance degrades, Its better to get a better broker, never compromise your replication factor, always get better machines if there is a performance issue. Never set it to one(1) in production. If a broker goes down, your partition is offline.

Cluster guidelines

It is pretty much accepted that a broker should not hold more than 2000 to 4000 partitions (across all topics of that broker)
Additionally, a Kafka cluster should have a maximum of 20,000 partitions across all brokers.
The reason is that in case of brokers going down, Zookeeper needs to perform a lot leader elections.
If you need more than 20,000 partitions in your cluster, follow the Netflix model and create more Kafka clusters

Conclusion

Overall, you don’t need a topic with 1000 partitions to achieve high throughput Start at a reasonable number and test the performance. You just don’t guess and get it right, you guess and test it

Choosing Right Partition Count & Replication Factor (Apache Kafka)

Kafka Terminology

Partitions Count, Replication Factor

Conclusion

Written by Ram Gopal Varma Alluri