Kafka Connect Overview

Lucky Kurhe

Published in

Walmart Global Tech Blog

3 min readFeb 5, 2020

Overview

Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems.

A common framework for Kafka connectors
Distributed and standalone modes
REST interface for better operations
Automatic offset management
Distributed and scalable by default
Streaming/batch integration

ref: https://kafka.apache.org/documentation/

Kafka connect is used to move large collection of data to/from Kafka cluster.

Kafka connect provides a framework that helps in writting connector for specific task of collecting-large-data-from-Kafka-and-send-to-target-system (Kafka connect sink) or sending-events-collected-from-some-source-system-to-Kafka (Kafka connect source)

Connectors written for Kakfa can be a Source or Sink connector.

Source connector helps with sending large set of data from a source system to Kafka (i.e. it produces events to Kakfa)
Sink connector helps with fetching large set of data from Kafka to be sent to a target system (i.e. it consumes events to Kakfa)

Writting these connectors is very simple with Kafka Connect.

While working with Kafka clusters, most of the heavy lifting related to how to connect, handling Kafka cluster specific entries, rebalance issues are taken care by kafka connect.

So developers can focus on tasks specific to project need and Kafka-Connect will handle the rest.

Implementation

In this section lets take a look at Sink Connector in detail (Source connector can be built in a similar way)

SinkConnector

Your connector class will extend SinkConnector, it's the entry point and is responsible for spinning required number of tasks by providing given configuration for each task.
More about this configuration will be covered in later part of this doc.

SinkTask

Your sink task class will extend SinkTask, it contains main logic to be applied on records received from Kafka cluster.

SinkRecord

Gives detail for each record fetched from Kafka cluster, including value, topic, KafkaOffset, KafkaPartittion etc..

connect-distributed properties

Note: While running for prod scale, kafka connect will be running in a distributed mode and storage-topics will keep track of config, offset and status of each connector.
These needs to be created specifically on the Kafka cluster to which SinkConnector will connect

Prod Deployment

Once Sink connector is built, deploying on a distributed system needs a proper deployment script that will help with operation and maintenance related tasks.
Ansible based script to automate deployment suits the best.
Ansible can provide options to run Kafka-connect on multiple ports, specify Kafka version, options for full deployment or specific operational tasks etc

Performance and Scale

Performance and Scale are very important factors while running for production.
Here are few stats from our testing.. Even higher numbers can be achieved with the help of horizontal scaling
Test stats: 1Million events/sec, 1.5GB data processed/sec

My opinion

Based on provided features, native support from Kafka, ease in writting a connector and scale/performance, make sure to consider Kafka Connect as one of the option, when there is a need to move large collection of data to/from Kafka cluster

Kafka Connect Quickstart

next post will help with quick-setup to start

Hope you enjoyed this article and learned something new. If yes, remember to hit the clap button :-)