Kafka Connect : Quick Start
In this article, we’ll see Kafka Connect components, why it is needed and how it helps to integrate Kafka with different systems.
Note: new to Kafka ?? check “Apache Kafka: Quick start”
- Kafka Connect
What is Kafka Connect?
Kafka Connect is an opensource component of Apache Kafka and provides scalable and reliable way to transfer data from Kafka to other data systems like databases, filesystems, key-value stores and search indexes. It uses Connectors that moves large data sets into and out of Kafka.
Kafka Connect uses Producers and Consumers as a building blocks and provide higher level functionality.
Need for Kafka Connect?
Apache Kafka is widely used in event driven microservice architecture, CDC (Change Detection Capture) requirements, log aggregation.
- Source Connector:
It’s responsible for pulling the data from the source system & publish it to Kafka Cluster. Source connector internally use Kafka Producer API to achieve this.
- Sink Connector:
It’s responsible for consume the data from Kafka cluster and sync it to target systems. Sink connector internally use Kafka Consumer API to achieve this.
Kafka Connect use cases?
- Streaming pipelines from source to target system.
- Writing data from application to data stores from Kafka.
- Processing data from legacy application to new systems from Kafka.
Kafka Connect Components:
Connectors defines where the data should be copied to and from. Each Connector instance is a logic job that is responsible for managing the movement of data between Kafka and external system.
Confluent offers pre-built connectors, which helps to integrate with Kafka. We can also write new connector from scratch as per the requirement
Tasks plays a major role in the data model for Kafka Connect. Each connector instance has set of tasks that actually copy the data. We have two types of tasks — Source tasks and Sink tasks.
Source tasks contains code to get the data from the source system and it uses Kafka producer to push the data to a Kafka topic.
Sink task uses Kafka consumer to poll the data from a Kafka topic and it contains code to put the data into sink system.
Kafka connect provides built-in support for scalable and parallel data copying by breaking single job into many tasks with minimal configuration.
It plays important role during SMT — Single Message Transforms. Connectors can be configured with Transformations to make simple modifications to individual messages. We can also configure multiple transformations in connector configuration.
A transform is a function which accepts one record as input and returns the modified record. Kafka connect perform simple common transforms, we can also write our own transform by implementing Transformation interface.
Converters are responsible for serializing and de-serializing the data.
- When Kafka Connect as a source, Converter serializes the data received from Connector (or) Transform and push the serialized data into Kafka cluster.
- When Kafka Connect as a Sink, Converter de-serializes the data read from Kafka cluster and send it to Transform (or) Connector.
- String: org.apache.kafka.connect.storage.StringConverter
- JSON: org.apache.kafka.connect.json.JsonConverter
- JSON Schema: io.confluent.connect.json.JsonSchemaConverter
- Avro: io.confluent.connect.avro.AvroConverter
- Protobuf: io.confluent.connect.protobuf.ProtobufConverter
- ByteArray: org.apache.kafka.connect.converters.ByteArrayConverter
Connector and Tasks are the logical unit of work, must be scheduled to execute in a process. Kafka connect calls these processes workers and have two modes for running workers in Kafka Connect — standalone mode and distributed mode. We can choose modes as per our requirement.
- Standalone workers:
As name indicates, its suitable for development & testing Kafka connect in local machines. Its suitable for light-weight, single-agent environments (example — Sending server logs to Kafka).
- Distributed workers:
Distributed mode brings following benefits:
- Its recommended for production environments because of high availability and scalability — We can add more nodes or remove nodes to evolve as per requirement.
- Kafka Connect is more fault tolerant — If a node dies/leaves cluster for some reason Kafka Connect automatically distributes the work load of that node to other nodes in the cluster. If a new node joins, again work will be redistributed within the cluster.
- Distributed mode runs Connect workers on multiple nodes which in turn form Connect cluster. Kafka connect also distributes running connectors across the cluster.
In this article, we went through what Kafka Connect brings for us and how it helps to integrate external systems with Kafka connectors.
We also saw various Kafka Connect components and what each component is responsible for, while reading and writing data from/to Kafka cluster.