With today’s technological advancement, there is a huge growth in digital data that enterprises create. With the increase of data, the need to analyze data to draw insights from it also increased. Within an enterprise,
there is a multiple-source system that generates the data and this data is used by multiple systems across an enterprise.
Here comes the challenge of maintaining such an infrastructure and ensure the resiliency of the system and reliability and consistency of the data. The Solution that came out to help with the situation and simplify the architecture is known as Kafka.
What is Apache Kafka
Apache Kafka is an open source stream-processing software platform developed by the Apache Software Foundation. It aims at providing a platform for handling real-time streaming data. It is based on publisher/subscriber message queue architecture making it crucial for companies infrastructures to publish, access and process data.
Apache Kafka Architecture
The following Figure describes the Kafka Architecture.
The communication between the Kafka cluster and the different component is based on the TCP protocol.
A stream of record is created by the Producer and is written to one of the existing Topics in Kafka cluster or to a new created Topic. The stream of record in the Topic will be consumed by a Consumer or Stream Processor.
The Stream Processor is able to not only consume the data in a specific Topic but also to transform the stream record into a new record or enrich it and write back to the cluster into a new or the same Topic.
The changes happening to the record may be logged into Relational Databases for updates using Connectors.
Based on the documentation and the above explanation, we can derive that the Apache Kafka Architecture has five core APIs
The Producer API allows an application to publish a stream of records (data) to one or more Kafka topics.
The Consumer API allows an application to subscribe to one or more topics and process data produced to them.
The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table which allow us to ensure the integration of Relational Databases data to the Kafka Cluster and automatically monitor the changes and pull those changes onto the Kafka cluster.
The Admin API allows managing and inspecting topics, brokers and other Kafka objects
Any application can play the role of the above mention concepts as the Kafka Cluster is highly flexible.
Kafka cluster diagram
As described above, A Kafka cluster is just a set of Kafka brokers coupled with Zookeeper and Schema registry (optionnal).
A topic is a logical channel to which producers publish message and from which the consumers receive messages.. Topics in Kafka are always multi-subscribers which means that a topic can have zero, one, or many consumers that subscribe to the data written to it.
Messages in Kafka are simple byte arrays(
JSON etc). When published, a message can be attached to a key. Producer distributes all the messages with same key into the same partition.
- A topic is identified by its name.
- Data in topics are kept for a specific time TTL (Time to Live) that can be configured according to our needs.
- A topic is then divided into partitions, where each contains a subset of a topic’s messages.
- Topics are split into partitions. While creating a topic, one can specify the number of partitions in a topic.
- Each partition is ordered.
- Order is guaranteed only within a partition. (not across partitions)
- Each record within a partition is assigned an incremental id called the offset. The Offset is just an increasing sequence id given to each message within a partition in the order of the time they arrive. They are immutable sequence starting from 0 in each partition and identify uniquely each record within the partition.
- Once the data is written to a partition, it cannot be changed. i.e it is Immutable.
- We have multiple partitions primarily to increase throughput; parallel access to the topic can occur (When we have more than 1 partition, the data is split across the partitions and spread across multiple brokers)
- There is no guarantee about that to which partition a published message will be written.
- We can add a key to a message to ensure that all the messages that have the same key will end up in the same partition. If none, then the partition will be selected randomly
- There can be any number of Partitions, there is no limitation
Kafka cluster is composed of multiple brokers. A Broker is just a server.
- Every Broker is identified by an ID.(broker.id)
- Once you connect to a broker, you get connected to whole Kafka cluster.
- Brokers hold the topic and the partitions.
- Producer is an application which will write data to topics.
- Producers knows which broker and partition to write to. If you send the messages without specifying key, the messages would be sent to brokers in round robin. This will ensure equal load on every broker i.e load balancing.
- But if you send with messages with key, all the messages for that key will be stored in the same partition.
When the data is send to brokers, producer can be configured to receive confirmation of messages:
ack=0 No acknowledgment, this has possible data loss.
ack=1 Only leader will send acknowledgment, it would have limited data loss.
ack=all All the ISR and leader will send acknowledgment, No data loss,slower speed of processing.
- Consumer is an application which will read data from topic.
- Consumer reads data from topic. Data is read in order within each partition.
- Consumer groups have multiple consumers.
- Within a consumer group, a consumer will read data from set of partitions. For any given partition, only one consumer will read from it.
- If you have more consumers than partitions within a consumer group, some consumers would be inactive
- If you want to have high number of consumers, one should have high number of partitions. Therefore deciding on number of partitions is important.
- Kafka stores the offset of every consumer group which it has read.
- As consumer group read data, the offset are committed live in a Kafka topic named _consumer_offsets.
- When a consumer in a group has processed data received from Kafka, it should be committing the offset to above topic.(This is automatically done).
- Consumer offset are important because if a consumer goes down, after its up it would know from which offset to read data.
Offsets can be committed in 3 ways as below :
At most once: Offset are committed as soon as message is received. If processing goes wrong then message will be lost. (Not preferred way).
At least once: Offset are committed after message is processed. If processing goes wrong then message will be read again. (Preferred way).
Only once: This is used only for Kafka to Kafka workflows.
Kafka Replication Factor
Kafka is a highly scalable, fault-tolerant distributed system.
- highly scalable as we can scale horizontally by increasing the number of partitions in our topic and distribute the workload.
- Fault Tolerant: There is no single point of failure in Kafka. There are multiple copies of each message stored across different brokers in a Kafka cluster.
Topic should have a replication factor > 1. It is recommended to have 3 partitions. This is used only for Kafka to Kafka workflows. Replication factor of 3 means, there would be 3 copies of data on 3 different servers (Brokers).
- Each partition will have a Leader and multiple ISR(in-sync replicas). ISR is the number of redundant copies apart the one that exist on the leader.
- Only leader can receive and serve data for that partition.
- If leader goes down, then one of ISR becomes the leader. When the leader comes back, the temporary leader goes back to ISR.
Suppose We have 3 brokers, a Topic with 3 partition and replication factor of 3, all 3 partition would be on all 3 brokers. But only one broker can be a Leader for given partition.
Kafka uses Zookeeper to perform the following tasks :
Electing a controller- The controller is one of the brokers and is responsible for maintaining the leader/follower relationship for all the partitions. When a node shuts down, it is the controller that tells other replicas to become partition leaders to replace the partition leaders on the node that is going away. Zookeeper is used to elect a controller, make sure there is only one and elect a new one it if it crashes.
Cluster membership- Which brokers are alive and part of the cluster
Topic configuration- Which topics exist, how many partitions each has, where are the replicas, who is the preferred leader, what configuration overrides are set for each topic
Manage Quotas- How much data is each client allowed to read and write
Access control- Who is allowed to read and write to which topic (old high level consumer). Which consumer groups exist, who are their members and what is the latest offset each group got from each partition.
Through this articles I’ve been discussing the basic principles about how Kafka works and why does it need ZooKeeper. You can find out Kafka documentation here.