Apache Kafka Series [Part 1]: Introduction to Apache Kafka
The development landscape has shifted thanks to microservices. They increase developer agility by removing dependencies like shared database tiers. However, the distributed applications your developers are creating will require some form of data sharing integration. The synchronous method, which is one of the most popular integration options, uses application programming interfaces (APIs) to communicate data between users.
The asynchronous technique is a second integration option that requires replicating data in an intermediary repository. This is where Apache Kafka comes in, streaming data from other development teams to populate the data store so that it may be shared around many teams and apps.
Apache Kafka is an open-source distributed publish-subscribe messaging platform that has been purpose-built to handle real-time streaming data for distributed streaming, pipelining, and replay of data feeds for fast, scalable operations.
In other words, it simultaneously moves vast volumes of data — not just from one point to another, but also from that point to anywhere else you need it.
It began as a LinkedIn internal system designed to handle trillions of messages a day, but it has since evolved into an open source data streaming solution with applications for a number of business purposes.
When to use Apache Kafka?
Because it was designed specifically for real-time log streaming, Apache Kafka is appropriate for applications that require:
- Reliable data exchange between separate components
- The ability to partition messaging workloads as application requirements change
- Real-time streaming for data processing
- Native support for data/message replay
How does it work?
Kafka combines the benefits of two messaging models, queuing and publish-subscribe, to give users with the best of both. Data processing can be split among numerous consumer instances via queueing, making it highly scalable. Traditional queues, on the other hand, are not multi-subscriber. Although the publish-subscribe method is multi-subscriber, it cannot be utilized to distribute work across numerous worker processes because each message is sent to each subscriber. To connect these two systems, Kafka employs a partitioned log architecture.
A log is a ordered collection of records that are organized into segments, or partitions, that correspond to distinct subscribers. This means that numerous subscribers that are subscribed to the same topic can be assigned to different partitions, allowing for greater scalability.
What are main components of Apache Kafka?
- Topics: In publish/subscribe messaging, a topic is a fairly universal concept. A topic is an addressable abstraction used to demonstrate interest in a specific data stream (series of records/messages) in Apache Kafka and other messaging platforms. A topic is an abstraction layer that the application uses to demonstrate interest in a certain stream of data. It can be published and subscribed to.
- Partitions: Topics in Apache Kafka can be separated into partitions, which are a series of order queues. A sequential commit log is formed by repeatedly appending these segments. Each record/message in the Kafka system is given a sequential ID called an offset, which is used to identify the message or record in a certain partition.
- Producers: The concept of a producer in Apache Kafka is similar to that of most messaging systems. A data producer (records/messages) specifies the topic (data stream) on which a particular record/message should be published. Because partitions are used to increase scalability, a producer can choose which partition a given record/message is published to. Producers are not required to identify a partition, and by doing so, load balancing between topic partitions can be performed in a round-robin fashion.
- Consumers: Consumers are the entities that handle records/messages in Kafka. Consumers can be set up to work independently on their own tasks or collaboratively on a particular workload with other consumers (load balancing). Consumers manage their task processing based on the consumer group they belong to. Consumers can be grouped into a consumer group by using a consumer group name. Consumers can be divided within a single process, across many processes, and even across multiple systems by using a consumer group name. Consumers can use consumer group names to load balance record/message consumption throughout the consumer set (many consumers with the same consumer group name), or process each record/message individually (multiple consumers with unique consumer group names) where every consumer subscribed to a topic/partition gets the message for processing.
Starting with Apache Kafka on your local machine
They say the best way to learn something is by doing it. In this section we’ll see how we can spin up a single node Kafka cluster on our local machine and how can we produce and consume via the Kafka CLI.
Starting up the cluster
- Start by downloading the latest Apache Kafka binaries from their official download page.
- Extract it somewhere. (For windows users, if you later get the “The input line is too long error”, extract it into C: drive or a folder in it but not in a long nested directory).
- Go to the config directory in the extracted folder and edit server.properties. Change the listeners property which would be commented out by default to the following and leave the rest of it as it was.
listeners=PLAINTEXT://localhost:9092
- Now open up a terminal in bin/windows/ directory for windows and bin/ directory for Linux or Mac.
Note: I’ll be using the windows commands that are run via the bat files. But if you’re using Linux or Mac, run the shell files instead from bin directory and change the relative path for arguments accordingly.
- Run the following command in windows/ directory to start zookeeper first.
>zookeeper-server-start.bat ..\..\config\zookeeper.properties
- Then start your Kafka broker using the following command.
>kafka-server-start.bat ..\..\config\server.properties
If everything goes fine you’ll see the output that a new controller is recorded along with your broker connection string as shown below.
INFO [broker-0-to-controller-send-thread]: Recorded new controller, from now on will use broker localhost:9092 (id: 0 rack: null) (kafka.server.BrokerToControllerRequestThread)
That’s it! Your broker is now running on localhost:9092 and zookeeper on localhost:2181 which is the default config for both of them. You can change it in their properties file respectively.
Creating a topic and producing/consuming from it
- Open up another terminal in the same directory.
- We’ll use kafka-topics script to create a topic using the following command. This topic will have a replication factor of 1 (Replication factor is used to prevent loss of data due to failure of a broker in a multi node cluster) and a single partition for simplicity in consuming.
>kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic MyFirstTopic
You’ll see a “Created topic MyFirstTopic.” message if the topic is created successfully. In your Kafka console, you can also see that a new log is created for partition MyFirstTopic-0 with its properties.
- Now since we’ve our topic we’ll produce some message to it using the following command which uses kafka-console-producer script.
>kafka-console-producer.bat --broker-list localhost:9092 --producer.config ..\..\config\producer.properties --topic MyFirstTopic
If you see “>” with a blinking cursor, you’re now in producing window. Write any message and press enter and that message would now be published to your topic.
- Once the message is produced you can consume it using kafka-console-consumer script with the following command.
>kafka-console-consumer.bat --bootstrap-server localhost:9092 --consumer.config ..\..\config\consumer.properties --topic MyFirstTopic --from-beginning
The “ — from-beginning” property is optional and only used if the consumer wants to read every message from the beginning.
That’s it! You’ll now see the messages that you produced.
If you want to see the messages processed in real-time, open up separate producer and consumer terminals and that way you can see the messages are being consumed as soon as you produce them.
Conclusion
Now you’ve the basic understanding of what Apache Kafka is, why do we need it and how can we get started with it.
In the next part of this series, we’ll see what are our options for security in Apache Kafka if we want to use it in production since the cluster we created in this article is not in any way ready to be used in production grade applications. We’ll also look into how can we decide when to use which security option and why.
Until then, Adios and Good luck!