Kafka quick introduction and AEM Integration
Introduction
Apache Kafka is an open-source distributed event streaming platform used by many companies around the world, allowing us high-performance data pipelines, streaming analytics, data integration for mission-critical applications.
Companies use to start using a simple integration point-to-point, something like this:
After a while, the requirements change, and then the companies need integrations with more sources and more targets complicating in the future.
The above architecture has the following problems:
- If you have 4 source system and 6 target systems, you need to write 24 integrations
- Each integration comes with difficulties around protocols (TCP, HTTP, REST, FTP, JDBC, etc), data format (Binary, CSV, JSON, etc), data schemas how the data is shaped and may change.
- etc.
One solution is to use Apache Kafka something like that:
Why use Apache Kafka:
- Distributed, resilient architecture, fault-tolerant.
- Horizontal scalability: Can scale to 100s of brokers, can scale to millions of messages per second.
- High performance (latency of less than 10ms) real-time
- Used by many companies, 35% of the fortune 500: Airbnb, Netflix, Linked In, Uber, Walmart, etc.
Use Cases:
- Messaging System
- Activity tracking
- Application Logs
- Stream processing
- De-coupling of system integrations
- Integration with Big Data.
Quick start
Topics, partitions, and offsets
Topics, a particular stream of data
- Similar to a table in a database.
- You can have as many topics as you want, depends on your requirements.
- A topic is identified by its name.
Partitions, topics are split into partitions.
- Each partition is ordered
- Each message within a partition gets an incremental id, called offset
Offset, a specific message in each partition
- Each offset only has a meaning for a specific partition, for example, offset 3 in partition 0 doesn’t represent the same data as offset 3 in partition 1.
- Order is guaranteed only within a partition (not across partitions).
- The data is kept only for one week, this is the default time.
- Once the data is written to a partition, it can’t be changed.
Brokers
A Kafka cluster is composed of multiple brokers(servers). Let’s see some features:
- Each broker is identified with its ID (integer).
- Each broker contains certain topic partitions.
- If we connect to any broker, we’ll be connected to the cluster.
Producers
- Producers write the data into topics, remember topics are made of partitions.
- Producers automatically know which broker and partition to write to, we don’t have to specify that, and so for us, that removes a lot of the burden, we just connect to Kafka.
- In case of broker failures, producers will automatically recover, we don’t have to implement that feature, it’s done for us.
Consumers
- Consumers read data from a topic, they know which broker to read automatically, it’s already programmed for us.
- In case of broker failures, consumers know how to recover, we don’t have to think about this.
- Data is read and ordered within each partition, it means the offset 3 is read after offset 2 never before.
- Consumers can also read from multiple partitions but are not guaranteed to read in order across brokers, partition one and partition two it’s read in parallel.
Consumer groups
- Consumers read data in consumer groups
- Each consumer within a group reads from exclusive partitions.
- If you have more consumers than partitions, some consumers will be inactive.
- Consumers will automatically use a GroupCoordinator and a ConsumerCoordinator to assign a consumer to a partition, it’s a mechanism already implemented in Kafka.
Zookeeper
- Zookeeper manages brokers (keeps a list of them)
- Zookeeper helps us in performing leader elections for partitions. When a broker goes down, there’s a new partition that becomes a leader, zookeeper helps us with that.
- Zookeeper sends notifications to Kafka in case of changes, for example, new topic, new broker, broker dies, delete topic, etc.…
- Kafka can’t work without Zookeeper, we first start Zookeeper and then start Kafka.
Installation
Step1: Download Apache Kafka
- Download the last version from the downloads page Apache Kafka .
- Extract it
Step 2: Working with Kafka agent
Kafka has different scripts for Unix-based and Windows platforms. On Windows platforms use bin\windows\ instead of bin/ and change the script extension to .bat.
Start Zookeeper server: We need to first start a ZooKeeper, use the convenience script packaged with Kafka to get a quick-and-dirty single-node ZooKeeper instance. Use the command:
./zookeeper-server-start.sh ../config/zookeeper.properties
Start Kafka server: With the zookeeper server started, we can start the Kafka server, use the command:
./kafka-server-start.sh ../config/server.properties
With the previous steps executed we have a Kafka server working. In the next section we will see principal components to create a consumer and deploy it on AEM.
AEM Demo Consumer
In some cases AEM becomes such as the tool to handle the digital content (documents, images, videos, etc) into the companies that have a business online presence, however, the companies need to continue using the old systems and applications to manage this content and syn it in real-time with the new platforms, so is here where Apache Kafka is a great option.
We will create a litter integration between Kafka and AEM where a component will be the consumer in order to read messages from Kafka. Before start coding we need to think about some things such as Kafka configuration for example:
- Brokers list, comma-separated Kafka Cluster Servers, in format ip_hostname: port
- Topics List
- Enable/Disabled (true/false) the consumer component
- Enable auto-commit (true/false), the consumer offset will be periodically committed in the background.
- And other important configurations
Let’s start writing a configuration class. After creating the AEM project following the project setup we need to create a configuration class to include all the Kafka configurations. We can create an interface named KafkaConfigProperties
The next step is creating a class that allows us to get the properties from the configuration class (KafkaConfigProperties) and be accessible for the consumer component, let’s create a class named KafkaConfig.
The next step is creating a consumer class job in this case we will use a thread java API, we can create a class named AemKafkaConsumerJob that implements Runnable, so let’s do it
The first part of this class initializes all the required properties and also de KafkaConsumer object from Kafka API:
The final step is to create a client class to start our thread we can be called it as AemKafkaConsumerImpl let’s code it
So with those above steps, we can deploy this project on AEM and ee can use a tool in order to send messages to the Kafka cluster and receive those in the AEM Kafka consumer component, so I recommend the Kafka tool
Conclusion
The integration between systems is a common concern in modern enterprise architectures, so for this reason we need to take into account all the possibilities that we have in the industry, and Apache Kafka is a good candidate to solve the integration issue. This article is an overview of Apache Kafka, but we can review more details in the official documentation.