Simple Concept of Apache Kafka

Imam Muhajir
Analytics Vidhya
Published in
7 min readSep 28, 2021

--

INTRODUCTION

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. event streaming is the practice of capturing data in real-time from event sources like databases, sensors, mobile devices, cloud services, and software applications

The characteristics of Kafka are:
• It is a distributed and partitioned messaging system.
• It is highly fault-tolerant.
• It is highly scalable.
• It can process and send millions of messages per second to several receivers.

Kafka can be used for various purposes in an organization, such as:
Messaging service, Kafka can be used to send and receive millions of messages in real-time.
Real-time stream processing, Kafka can be used to process a continuous stream of information in real-time and pass it to stream processing systems such as Storm.
• Log aggregation, Kafka can be used to collect physical log files from multiple systems and store them in a central location such as HDFS.
Commit log service, Kafka can be used as an external commit log for distributed systems.
Event sourcing, Kafka can be used to maintain a time-ordered sequence of events.
Website Activity Tracking, Kafka can be used to process real-time website activity such as page views, searches, or other actions users may take.

DIFFERENCES BEFORE AND AFTER USING APACHE KAFKA

Why don’t we just use the usual method? Now we will simulate how our application does not use Apache Kafka.

BEFORE

For example, we have several applications like the one above, if we don’t use publish-subscribe when the ERP app requires ORDER data, then we have to send data to the ERP, and then the App Member sends data to the ERP, every data that is transacted requires a request. With this method, it’s not efficient when you have more apps, such as 100 apps, the data transaction diagram will be more complicated, and if one app crashes, the other app crashes. therefore another method is needed.

AFTER

After using Apache Kafka, the process of sending and receiving data through massage brokers, if an error occurs in one of the apps, it will not damage the existing flow. The above process will use publish and subscribe, publish/subscribe to transfer data in real-time and automatically through the Massage Broker. By using this method, we do not need to request the required data from a particular application, the process will be done automatically. Apache Kafka is also a Message Broker application. There are several other massage broker applications, but the most popular massage broker application is Apache Kafka.

PUBLISHER AND SUBSCRIBER

Apache Kafka is an application that can be used to process Publish and Subscribe. Apache Kafka is also a Message Broker application.

Publish = send data
Subscribe = receive data

By using publish/subscribe, all Publisher applications will send data to the topic (topic is a table if it is analogized in the database) then all subscribers will hear the data (the term for data retrieval is listening, so every incoming data will be immediately received by all subscribers). This process is what will be maintained by the message broker application.

A Topic is a category of messages in Kafka. The data in Kafka will be stored in a topic.

Kafka: Topic — Database: Table

Data on apache Kafka cannot be changed. In Kafka topic data cannot process updates, conceptually Kafka sends events or logs.

KAFKA ARCHITECTURE

  • The Streams API allows transforming the stream of data from input topics to output topics.
  • The Connect API allows data ingestion from some source system into Kafka or pushes from Kafka into some sink system.
  • The Producer API allows applications to send streams of data to topics in the Kafka cluster.
  • The Consumer API allows applications to read streams of data from topics in the Kafka cluster.

Applications can act as both producers and consumers at the same time.

Kafka can’t handle its own cluster therefore it takes a zookeeper to manage Kafka. All management on Kafka will be done in zookeeper, but data will be stored in apache Kafka.

PARTITION

Topics are divided into partitions, which are the unit of parallelism in Kafka. Partitions can be distributed across the Kafka cluster.
When creating a topic we can determine, we can determine how many partitions we want. Why use partitions? This is because 1 partition can only be subscribed by one app

1 partition — 1 apps

When using apache Kafka it is recommended to make the number of partitions more than the number of apps created.

• Each Kafka server may handle one or more partitions.
• A partition can be replicated across several servers for fault tolerance.
• One server is marked as a leader for the partition and the others are marked as followers.
• The leader controls the read and write for the partition, whereas, the followers replicate the data.
• If a leader fails, one of the followers automatically becomes the leader.
• ZooKeeper is used for the leader selection.

0 to 12 is called a log, the data will increase continuously and will form a history, for example, we add data to log 0 with a value of “1” and then on log 1 with a value of “1”. duplication has no effect on this method.

The above is an example when we have 2 consumers, in consumer A the data being consumed is the 9th word (offset 9 ) and for consumer, B is consuming the 11th data (offset 11). offset function is a marker of data distribution. For example, when consumer A consumes data 1, 2,3 and then the app dies, then when the app restarts the app doesn’t need to consume data from the beginning again, and can immediately continue consuming 4, 5, 6, etc. Partition and offset are set by Kafka automatically, so no need to worry about that.

REPLICATION

In the example above there are 3 Kafka servers, Replication is symbolized by R, and Partition is symbolized by P. For partition creation and replication will be managed by apache Kafka. In partition placement and replication, 1 server must not have the same 1 partition and 1 replication.
So it’s impossible
S = P<A> , R<A>
As an example:
Server 1 = P1
Server 2 = P2, R1
Server 3 = R2

CONSUMER GROUP

In Apache Kafka there is what is called a consumer group, the consumer group is a collection of applications, where one consumer group can only access one partition.

In the picture above, when there are 2 similar applications, there will be duplicate data on the app because each app consumes from 2 partitions. Therefore we use a customer group to solve this problem.

In the example above, one consumer group can only access one partition. And it can’t be more, therefore when you create a partition, the partition must be larger than the number of apps in one consumer group.

Number of Partitions >= Number of Apps

The above example is a wrong example because the number of apps is greater than the number of partitions.

RETENTION POLICY

We know that in Kafka the data cannot be changed, therefore it requires automatic deletion of data caused by data that continues to grow from time to time. There are 2 ways to wipe data in apache Kafka:

  1. Log Retention time
    Log retention time is a deletion of data based on time, for example when the data is 7 days old then the data will be deleted automatically (in Kafka the default log retention time is 7 days).
  2. Log Retention Bytes
    Log Retention Bytes is a deletion of data based on the amount of data. For example, when we have got 1 gigabyte, the old data will be deleted.
  3. Offset Retention time
    Offset Retention time is how long the offset data is stored in Kafka. Offset data is data that is stored by apache Kafka when the app that should be consuming Kafka data is not active. For example: when there is 1 inactive app, then the data will be stored in apache Kafka. If the app is inactive for a long time, the data stored in apache Kafka will increase. Therefore, an offset retention time is needed to delete the data. (default offset retention time is 7 days)

Thank you for reading this article. Hopefully, this article can be useful and helpful in the development of data. don’t forget to follow and give a lot of claps. See you next time

Source:

https://kafka.apache.org/
https://www.youtube.com/watch?v=SArQUV0CE2I&list=PL-CtdCApEFH8dJMuQGojbjUdLEty8mqYF
https://lms.simplilearn.com/courses/2810/Big-Data-Hadoop-and-Spark-Developer/syllabus

--

--

Imam Muhajir
Analytics Vidhya

Data Scientist at KECILIN.ID || Physicist ||Writer about Data Analysis, Big Data, Machine Learning, and AI. Linkedln: https://www.linkedin.com/in/imammuhajir92/