Kafka Connect Overview

Lucky Kurhe
Walmart Global Tech Blog
3 min readFeb 5, 2020

--

Overview

Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems.

  • A common framework for Kafka connectors
  • Distributed and standalone modes
  • REST interface for better operations
  • Automatic offset management
  • Distributed and scalable by default
  • Streaming/batch integration

ref: https://kafka.apache.org/documentation/

Kafka connect is used to move large collection of data to/from Kafka cluster.

Kafka connect provides a framework that helps in writting connector for specific task of collecting-large-data-from-Kafka-and-send-to-target-system (Kafka connect sink) or sending-events-collected-from-some-source-system-to-Kafka (Kafka connect source)

Connectors written for Kakfa can be a Source or Sink connector.

  • Source connector helps with sending large set of data from a source system to Kafka (i.e. it produces events to Kakfa)
  • Sink connector helps with fetching large set of data from Kafka to be sent to a target system (i.e. it consumes events to Kakfa)

Writting these connectors is very simple with Kafka Connect.

While working with Kafka clusters, most of the heavy lifting related to how to connect, handling Kafka cluster specific entries, rebalance issues are taken care by kafka connect.

So developers can focus on tasks specific to project need and Kafka-Connect will handle the rest.

Implementation

In this section lets take a look at Sink Connector in detail (Source connector can be built in a similar way)

SinkConnector

  • Your connector class will extend SinkConnector, it's the entry point and is responsible for spinning required number of tasks by providing given configuration for each task.
  • More about this configuration will be covered in later part of this doc.

SinkTask

  • Your sink task class will extend SinkTask, it contains main logic to be applied on records received from Kafka cluster.

SinkRecord

  • Gives detail for each record fetched from Kafka cluster, including value, topic, KafkaOffset, KafkaPartittion etc..

connect-distributed properties

  • Note: While running for prod scale, kafka connect will be running in a distributed mode and storage-topics will keep track of config, offset and status of each connector.
  • These needs to be created specifically on the Kafka cluster to which SinkConnector will connect

Prod Deployment

  • Once Sink connector is built, deploying on a distributed system needs a proper deployment script that will help with operation and maintenance related tasks.
  • Ansible based script to automate deployment suits the best.
  • Ansible can provide options to run Kafka-connect on multiple ports, specify Kafka version, options for full deployment or specific operational tasks etc

Performance and Scale

  • Performance and Scale are very important factors while running for production.
  • Here are few stats from our testing.. Even higher numbers can be achieved with the help of horizontal scaling
  • Test stats: 1Million events/sec, 1.5GB data processed/sec

My opinion

Based on provided features, native support from Kafka, ease in writting a connector and scale/performance, make sure to consider Kafka Connect as one of the option, when there is a need to move large collection of data to/from Kafka cluster

Kafka Connect Quickstart

next post will help with quick-setup to start

Hope you enjoyed this article and learned something new. If yes, remember to hit the clap button :-)

--

--