Essential parts for learning Apache Kafka quickly

Zuleny Estaba
Analytics Vidhya
Published in
4 min readApr 21, 2020

In this article, I will describe simply and briefly all the main points that we must know to know what Kafka is and how it works.

Kafka was developed in Java and Scala language. Its development began in 2009 by Linkedin and was donated to the Apache Foundation in 2011. In October 2012 it passed the incubation stage of the Apache Foundation.

Kafka’s goal is to provide a unified, high-performance, low-latency platform for real-time manipulation of data sources.

Below, I describe the parts that are essential to Kafka’s operation:
Zookeeper

Is a software that provides a high-performance coordination service for distributed applications, the function of zookeeper in Kafka is to connect the brokers to interact with each other, it takes care of the load balancing, exchange of metadata between brokers, the configuration of producers and consumers.
It recognizes when a new broker is integrated into the configuration, when the broker dies, when a new topic is added, the health of the partitions. In short, it centralizes all the configuration management.

Brokers

The brokers are Kafka’s servers that are in charge of receiving messages from the producer, assigning an offset, and commissioning or delivering the messages stored on disk

Offset

The offset indicates the state of each partition and its order of consumption. It uniquely identifies each message through a unique key and allows us to control the position of the consumer in the topic, that is, how the messages are consumed.

Topics and messages

Messages on Kafka are classified into topics. Additionally, each topic is broken down into many partitions.
The messages are written in a single append and read in order from beginning to end.
Partitions are also Kafka’s way of providing redundancy and scalability. Each partition can be hosted on a different server, which means that a
A topic can be scaled horizontally across multiple servers to provide performance far beyond the capacity of a single server
In short, the topic is the category where the messages are stored. Each topic contains partitions and offsets

Kafka has many different APIs that make it easier to use and configure:

https://kafka.apache.org/
https://kafka.apache.org/
  • The Producer API: allows an application to publish a sequence of records in one or more Kafka themes.
  • The Consumer API: allows an application to subscribe to one or more themes and process the sequence of records produced for them.
  • The Streams API: allows an application to act as a stream processor, consuming an input stream from one or more themes and producing an output stream to one or more output themes, effectively transforming the input streams into output streams.
  • The connector API: allows the construction and operation of reusable producers or consumers that connect to existing Kafka theme systems, applications or data. For example, a connector to a relational database could capture every change in a table.
  • The Management API: allows you to manage and inspect Kafka themes, intermediaries and other objects.

Producer

The producers publish data on topics of their choice. The producer is responsible for choosing the appropriate partition for the record he or she will write. This can be done through Round-Robin to distribute the load or it can be done by applying a semantic function on the key.

Consumer and consumer group

Consumers are grouped under the concept of consumer groups, so each record published on a topic will be given to a consumer within the consumer group subscribed to the topic. Consumers can be separate processes or different machines.
If all consumers belong to the same group, the load sharing (partitions) will be done on them. If all consumers belong to different groups, then each record will be issued to all consumers.

Example of the interaction between producers, consumers, topics and partitions https://kafka.apache.org/

KSQL

ksqlDB is a streaming event database designed specifically to help developers build streaming processing applications on top of Apache Kafka.
KSQL, works as an open-source streaming engine distributed and in real-time over Apache Kafka that allows to program the streams in SQL, avoiding code programming. KSQL offers a large set of processing operations such as aggregations, joins, sessions, among others

I hope this article was helpful.

--

--

Zuleny Estaba
Analytics Vidhya

Consultant and Speaker. Currently, I’m analysis of big data technology applications. Insisting on having a positive global impact through technology.