101 on Kafka (beginners)

Prabhu Rajendran
Everything at Once
Published in
4 min readMar 23, 2022
  1. What is Kafka & Why do need Kafka ?

Before diving into Kafka , lets know - how Kafka/Streaming platform come into picture:-

* Companies initially have “Source System & Target System” - (kind-off Monolithic architecture).* Later multiple Source system (S1,S2,S3...) & Multiple Target System (T1,T2,T3...) 
1. S1 can be user service
2. S2 can be order service
3. T1 can be notification service
4. T2 can be order delivery service
5. T3 can be order tracking service ....
* If Source S1 has to use T1 , T3 and T1 has to use S2 and S3 .. so on - which requires multiple integration and complicated * Each integration comes with different protocol (HTTP,TCP,REST,JDBC,,,) and different formats (JSON,BINARY,CSV,,,) - which increase Load on connected systems.* [PROBLEM Solution] - Here it comes, we "bring decoupling to system" using KAFKA , how it applies now ? * Source & Target Service be independent and KAFKA will placed middle to Source & Target.* what happens now , Source Service are responsible for Sending data (Producing) to Kafka , Target Service are responsible for Receiving Data(Consuming).
* Kafka is created by LinkedIn ,now it become OpenSource and maintained by Confluent, IBM, Cloudera.* [Features] :- Distributed, Fault Tolerant, Resilient Architecture , Horizontal Scalability , High Performance (less than 10ms) - used by LinkedIn (prevent spam , collect user interactions) , AirBnb, Uber (pricing real time), Netflix(for recommendations), Walmart ... - will see why & how it below.* Kafka is highly available and resilient to node failures and supports automatic recovery (key point to choose Kafka..).

History of Apache Kafka

Previously, LinkedIn was facing “issue of low latency ingestion of huge data from the website into a lambda architecture” which could be able to process real-time events. As a solution, Apache Kafka was developed in the year 2010, since none of the solutions was available to deal with this drawback, before.

However, there were technologies available for batch processing, but the deployment details of those technologies were shared with the downstream users. Hence, it comes to Real-time Processing, those technologies were not enough suitable. Then, in the year 2011 Kafka was made public.

2. Use Cases of Kafka (Real Time)

     - Activity Tracking 
- Messaging System (data transfer from one system to another system asynchronous) - point to point , publish-subscribe messaging system
- Metrics Gathering (logs.)
- Decoupling system
- Stream processing
- Micro Services Pub-Sub (above example S1,S2,T1..T3)
- Integration with BigData Technologies(spark, flink, hadoop..)

3. Kafka Core & Architecture Offers

    1. Kafka Broker - 
node on kafka cluster, used to persist and replicate the data. (by default data is available for a week and its configurable)
2. Kafka Producer API -
permits application to publish(push) stream of records to one or more topics. (sequence of message called data stream) via Producer API
3. Kafka Consumer API -
subscribe to one or more topics(pulls the message from topic) and process the events via Consumer API.
4. Kafka Streams API -
acts as stream processor, consuming one or more topics and producing output stream to one or more topics & transforming input streams to output streams - Kafka Stream API gives permission to application.

5. Kafka Connector API - allows building and running reusable producers and consumers that connects Kafka topics to existing application.

4. Why Should we use Apache Kafka ?

As we all know, there is an enormous volume of data in Big Data. And, when it comes to big data, there are two main challenges. One is to collect the large volume of data, while another one is to analyse the collected data. Hence, in order to overcome those challenges, we need a messaging system. Then Apache Kafka has proved its utility. There are numerous benefits of Apache Kafka such as:

  • Tracking web activities by storing/sending the events for real-time processes.
  • Alerting and reporting the operational metrics.
  • Transforming data into the standard format.
  • Continuous processing of streaming data to the topics.

5. Kafka Components :-

1. Topic :- (Once Written , can't be changed - Immutable)      - Particular stream of data (in relational database we call it as table)
- Identified by name , can have N' number of topics
- any kind of formats (JSON,CSV,..)
- topics cant be queried directly , send data to producer , use consumer to read data.
- can have many partitions - topic split into partitions
- each partition have incremental id (ordered) called offset.

2. Kafka Producer - publish message to topics
3. Kafka Consumer - Subscribe message from topics 4. Kafka Broker - manages the storage of messages in topics (persist and replicate) , more than one broker called cluster.5. Kafka Zookeeper - offers the brokers with metadata about the process running system, health checks and helps in broker leadership election.

6. Alternatives to Kafka :-

1. Flume - 
- special purpose tool for specific applications
- does not replicate events.

2. RabbitMQ - offers relatively less support features
- performance rate is 80 to 90K messages per second per connection (it varies which different client...) where Kafka is 1 Million message per second

These are basics things to know about Kafka.

Refer for more detailed : https://www.conduktor.io/kafka/what-is-apache-kafka

Thanks for Reading!.

Resources :

  1. https://www.cloudkarafka.com/blog/part1-kafka-for-beginners-what-is-apache-kafka.html
  2. https://www.confluent.io/blog/kafka-fastest-messaging-system/

--

--