Kafka — Let’s learn together

According to its official apache page: “Kafka is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.”

A small glimpse according to Stackshare.io of whom is using Kafka

Messages are organized into “Topics”. Producers push or publish the messages. Consumers pull the messages. As a consumer you subscribe to a topic to get the messages. Kafka runs in a cluster, each node is referred to as a broker.

A topic can have multiple partitions which are spread across multiple brokers. You can parallelize the consumers to pull from different topic partition.

Each partition is essentially a log file written sequentially. You can dictate the amount of time to store the data. Each broker has many partitions which can be replicated across additional brokers.

Each partition has a leader, which is where writes are sent to . Consistency and availability can be set .

When a consumer consumes a message from Kafka it uses the message offset to keep track of what messages have been consumed. It it consumes the first 50 messages in a topic, when new messages come in it will start at 50th offset key and start consuming those unread messages.

Consumer Groups are comprised of multiple consumers. Each consumer receives a unique partition of data that is processed as an entire group which allows for horizontal scaling.

ZooKeeper in short is used to make sure all the moving pieces of a Kafka cluster work together seamlessly. It helps with synchronization across all brokers and keeps track of broker system state (heartbeats), replication, and manages broker topic registries.