Kafka quick introduction and AEM Integration

Published in

Globant

6 min readFeb 5, 2021

Introduction

Apache Kafka is an open-source distributed event streaming platform used by many companies around the world, allowing us high-performance data pipelines, streaming analytics, data integration for mission-critical applications.

Companies use to start using a simple integration point-to-point, something like this:

After a while, the requirements change, and then the companies need integrations with more sources and more targets complicating in the future.

The above architecture has the following problems:

If you have 4 source system and 6 target systems, you need to write 24 integrations
Each integration comes with difficulties around protocols (TCP, HTTP, REST, FTP, JDBC, etc), data format (Binary, CSV, JSON, etc), data schemas how the data is shaped and may change.
etc.

One solution is to use Apache Kafka something like that:

Why use Apache Kafka:

Distributed, resilient architecture, fault-tolerant.
Horizontal scalability: Can scale to 100s of brokers, can scale to millions of messages per second.
High performance (latency of less than 10ms) real-time
Used by many companies, 35% of the fortune 500: Airbnb, Netflix, Linked In, Uber, Walmart, etc.

Use Cases:

Messaging System
Activity tracking
Application Logs
Stream processing
De-coupling of system integrations
Integration with Big Data.

Quick start

Topics, partitions, and offsets

Topics, a particular stream of data

Similar to a table in a database.
You can have as many topics as you want, depends on your requirements.
A topic is identified by its name.

Partitions, topics are split into partitions.

Each partition is ordered
Each message within a partition gets an incremental id, called offset

Offset, a specific message in each partition

Each offset only has a meaning for a specific partition, for example, offset 3 in partition 0 doesn’t represent the same data as offset 3 in partition 1.
Order is guaranteed only within a partition (not across partitions).
The data is kept only for one week, this is the default time.
Once the data is written to a partition, it can’t be changed.

Brokers

A Kafka cluster is composed of multiple brokers(servers). Let’s see some features:

Each broker is identified with its ID (integer).
Each broker contains certain topic partitions.
If we connect to any broker, we’ll be connected to the cluster.

Producers

Producers write the data into topics, remember topics are made of partitions.
Producers automatically know which broker and partition to write to, we don’t have to specify that, and so for us, that removes a lot of the burden, we just connect to Kafka.
In case of broker failures, producers will automatically recover, we don’t have to implement that feature, it’s done for us.

Consumers

Consumers read data from a topic, they know which broker to read automatically, it’s already programmed for us.
In case of broker failures, consumers know how to recover, we don’t have to think about this.
Data is read and ordered within each partition, it means the offset 3 is read after offset 2 never before.
Consumers can also read from multiple partitions but are not guaranteed to read in order across brokers, partition one and partition two it’s read in parallel.

Consumer groups

Consumers read data in consumer groups
Each consumer within a group reads from exclusive partitions.
If you have more consumers than partitions, some consumers will be inactive.
Consumers will automatically use a GroupCoordinator and a ConsumerCoordinator to assign a consumer to a partition, it’s a mechanism already implemented in Kafka.

Zookeeper

Zookeeper manages brokers (keeps a list of them)
Zookeeper helps us in performing leader elections for partitions. When a broker goes down, there’s a new partition that becomes a leader, zookeeper helps us with that.
Zookeeper sends notifications to Kafka in case of changes, for example, new topic, new broker, broker dies, delete topic, etc.…
Kafka can’t work without Zookeeper, we first start Zookeeper and then start Kafka.

Installation

Step1: Download Apache Kafka

Download the last version from the downloads page Apache Kafka .
Extract it

Figure 6. Apache Kafka source

Step 2: Working with Kafka agent

Kafka has different scripts for Unix-based and Windows platforms. On Windows platforms use bin\windows\ instead of bin/ and change the script extension to .bat.

**Figure 7. Apache Kafka source content**

Start Zookeeper server: We need to first start a ZooKeeper, use the convenience script packaged with Kafka to get a quick-and-dirty single-node ZooKeeper instance. Use the command:

./zookeeper-server-start.sh ../config/zookeeper.properties

Start Kafka server: With the zookeeper server started, we can start the Kafka server, use the command:

./kafka-server-start.sh ../config/server.properties

With the previous steps executed we have a Kafka server working. In the next section we will see principal components to create a consumer and deploy it on AEM.

AEM Demo Consumer

In some cases AEM becomes such as the tool to handle the digital content (documents, images, videos, etc) into the companies that have a business online presence, however, the companies need to continue using the old systems and applications to manage this content and syn it in real-time with the new platforms, so is here where Apache Kafka is a great option.

We will create a litter integration between Kafka and AEM where a component will be the consumer in order to read messages from Kafka. Before start coding we need to think about some things such as Kafka configuration for example:

Brokers list, comma-separated Kafka Cluster Servers, in format ip_hostname: port
Topics List
Enable/Disabled (true/false) the consumer component
Enable auto-commit (true/false), the consumer offset will be periodically committed in the background.
And other important configurations

Let’s start writing a configuration class. After creating the AEM project following the project setup we need to create a configuration class to include all the Kafka configurations. We can create an interface named KafkaConfigProperties

The next step is creating a class that allows us to get the properties from the configuration class (KafkaConfigProperties) and be accessible for the consumer component, let’s create a class named KafkaConfig.

The next step is creating a consumer class job in this case we will use a thread java API, we can create a class named AemKafkaConsumerJob that implements Runnable, so let’s do it

The first part of this class initializes all the required properties and also de KafkaConsumer object from Kafka API:

The final step is to create a client class to start our thread we can be called it as AemKafkaConsumerImpl let’s code it

So with those above steps, we can deploy this project on AEM and ee can use a tool in order to send messages to the Kafka cluster and receive those in the AEM Kafka consumer component, so I recommend the Kafka tool

Conclusion

The integration between systems is a common concern in modern enterprise architectures, so for this reason we need to take into account all the possibilities that we have in the industry, and Apache Kafka is a good candidate to solve the integration issue. This article is an overview of Apache Kafka, but we can review more details in the official documentation.