Part 1 — What is Kafka? Getting Started with Kafka Ecosystem

10 min readJul 5, 2020

TL;DR The Goal of this post is to share very high-level knowledge about what Kafka is?, use cases, core APIs, components, some of the available tools, best practices/gotchas, etc. This is the first post towards a series of blog posts for Kafka.

What is Kafka?

Apache Kafka is a distributed streaming platform that is effective and reliable when handling massive amounts of incoming data from various sources heading into the numerous outputs.

Apache Kafka® is a distributed streaming platform

History of Kafka

**Source:** PluralSight course — Handling Streaming Data with a Kafka Cluster by *Bogdan Sucaciu*

Kafka Use Cases

Event Sourcing
Messaging
Log Aggregation
Stream Processing
Pub/Sub
Activity
Modern Extract, Transform, and Load (ETL)
Change Capture
Big Data/Analytics
Fraud/Anomaly Detection
Many more

Benefits

Low Latency
High Throughput
Fault-Tolerant
Scalability
High Concurrency
By Default Persistent
Immutable
Reliable and durable
Distributed

Limitations

Issues with Message Tweaking
No support for Wildcard Topic Selection

Kafka has five core APIs

Producer API — The Producer API allows applications to send streams of data to topics in the Kafka cluster.
Consumer API — The Consumer API allows applications to read streams of data from topics in the Kafka cluster.
Streams API — The Streams API allows transforming streams of data from input topics to output topics. The future is very bright to implement real-time analytics systems using Streams API. However, there are lots of performance issues with certain stream functions you have to be careful about. I will cover more examples of Stream API in a future article dedicated to stream and some of its gotchas.
Connector API — The Connect API allows implementing connectors that continually pull from some source data system into Kafka or push from Kafka into some sink data system. Many prebuilt connectors are available for different source and sink systems. SDK also supports the ability to implement custom connectors.

Home

Confluent, founded by the creators of Apache Kafka, delivers a complete execution of Kafka for the Enterprise, to help…

www.confluent.io

Admin API — The Admin API supports managing and inspecting topics, brokers, ACLS, and other Kafka objects.

Support Message Formats

Apache AVRO

Avro provides:

Rich data structures.
A compact, fast, binary data format.
A container file, to store persistent data.
Remote procedure call (RPC).
Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.
Avro relies on schemas.

Welcome to Apache Avro!

Apache Avro™ is a data serialization system. To learn more about Avro, please read the current documentation. To…

avro.apache.org

JSON

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate.

JSON

Edit description

www.json.org

PROTOBUF

Protocol Buffers is a high-performance, compact binary wire format invented by Google that uses it internally so they can communicate with their internal network services at a very high speed.

As of Confluent Version 5.5 (Apache Kafka 2.5.x) now supports schema for JSON and PROTOBUF formats as well. Prior versions only supported schema integration for AVRO format.

EcoSystem/Components Overview

Apache zookeeper (Open Source)

Apache Zookeeper is a centralized service for maintaining metadata about a cluster of distributed nodes like Configuration information, Heath status, Group membership, etc.

Confluent is in motion to eliminate Apache Zookeeper as a meta store from Kafka Ecosystem by adding self-managed metadata quorum KIP-500

Apache Kafka aka Broker (Available as Open Source & Confluent License)

The Broker is the heart of the Kafka Ecosystem. A Kafka server that manages the persistence and replication of message data (i.e., the commit log).

Kafka Connect (Open Source)

Kafka has a built-in framework called Kafka Connect for writing sources and sinks that either continuously ingests data into Kafka or continuously ingest data in Kafka into external systems. The connectors themselves for different applications or data systems are federated and maintained separately from the main codebase. You can find a list of available connectors at the Confluent Hub.

Kafka REST Proxy (Open Source)

REST Proxy provides a RESTful interface to a Kafka cluster, making it easy to produce and consume messages, view the state of the cluster, and perform administrative actions without using the native Kafka protocol or clients.

Currently supporting — Metadata, Producers, Consumers, Data Formats, etc.

Limitation — Supports only SSL Cert based security. No support for extending ACL. The REST Proxy adds extra processing to affect performance.

Kafka Schema Registry (Open Source)

Confluent Schema Registry provides a serving layer for your metadata. It provides a RESTful interface for storing and retrieving your Avro®, JSON Schema, and Protobuf schemas. It stores a versioned history of all schemas based on a specified subject name strategy, provides multiple compatibility settings, and allows the evolution of schemas according to the configured compatibility settings and expanded support for these schema types. It provides serializers that plug into Apache Kafka® clients that handle schema storage and retrieval for Kafka messages that are sent in any of the supported formats. Schema Registry lives outside of and separately from your Kafka brokers.

KSQL DB Server (Open Source)

KsqlDB is an event streaming database purpose-built to help developers create stream processing applications on top of Apache Kafka®.

Control Center (Confluent License)

Confluent Control Center is a web-based tool for managing and monitoring Apache Kafka®. Control Center facilitates building and monitoring production data pipelines and streaming applications.

More options bring more confusion to select the correct API. Be mindful of selecting API/components for your use case.
e.g.: To produce data you can build custom producers or use Kafka Connect or Kafka REST Proxy or tools like FluentBit to bring data into Kafka.

Tools

Kafkacat (Open Source)

One of the most commonly used Command Line Interface (CLI) to interact with Kafka for as Publisher and Consumer. Very mature Kafka tool, good documentation with lots of examples.

Rate: 4.5 (Have room for improvement in terms of filtering data points and querying certain data.)

edenhill/kafkacat

Copyright (c) 2014-2020 Magnus Edenhill https://github.com/edenhill/kafkacat kafkacat logo by @dtrapezoid kafkacat is a…

github.com

Kafdrop (Open Source)

obsidiandynamics/kafdrop

Kafdrop is a web UI for viewing Kafka topics and browsing consumer groups. The tool displays information such as…

github.com

Kafdrop is a web UI for viewing Kafka topics and browsing consumer groups. The tool displays information such as brokers, topics, partitions, consumers, and lets you view messages. The interface is very buggy and not a great experience over some of the other command-line tools.

Rate: 2 (Good option for folks who like to interact with web interface. Not a robust system to assess your Kafka data)

Lense.io (Paid)

Lense.io provides a product that has a mature user interface and comes in as a Paid option only. AWS partnered with lense.io to be more strategic to advance recently launched MSK solution as of 2018 re:Invent.

Rate: 3.5 (If you can afford paid version and looking for We based user interface)

Operatr (Paid)

Operatr provides a fully-featured web interface with Monitoring for Kafka. It is only available as a Paid option. It provides really deep monitoring features to support the Kafka ecosystem.

Rate: 4.5 (Again If you can afford paid version and looking for We based user interface)

Zoe (Open Source)

Zoe - The Kafka Companion

 Zoe is a command line tool to interact with kafka in an easy and intuitive way. Wanna see this in action ? check out…

adevinta.github.io

One of the latest Kafka clients. It seems very slick in terms of features it brings. However, it is very slow, barely considered as solid documentation. I am impressed with features so definitely looking forward to seeing this CLI get mature. Zoe shines in Kubernetes environments since it can leverage Kubernetes scaling for consumption and execution.

Rate: 3.5 (Currently ratting only 3 since it is less than a couple of weeks into the first release and definitely needs performance improvement and documentation improvements)

Kukulcan — Scala/Java/REPL (Open Source)

mmolimar/kukulcan

K'uk'ulkan ("Feathered Serpent") is the name of a deity which was workshipped by the Yucatec maya people. You can read…

github.com

Again new kid in a town as Scala/Java REPL for Kafka. Kukulcan provides an API and different sorts of REPLs to interact with your streams or administer your Apache Kafka deployment. Depends on at least JDK 11 and SBT.

Rate: N/A (Have not had the opportunity to explore yet so can not rate as this point)

Best Practices

Avoid default configuration for Kafka components

Things like disable topic auto-creation
Override default topic partition and replication factor
Use appropriate memory to avoid reading from disk.
Use data compression

Avoid default configuration for Publishers

acknowledgment aka acks — how many in sync replicas need to acknowledge the message when publishing a message to the topic. Affects performance.
retries — number of retries if publishing message fails
max.in.flight.requests — number of max messages being processed which are not acknowledged.
Understand your data distribution to select the appropriate key for topic partition.

Avoid consumer configurations

Polling
Design consumers for high throughput. For throughput tune batch.size and buffer.memory configurations

General

Review the fundamentals and internals of APIs you are planning to leverage.
Make sure to perform a load test before you take your Kafka architecture to production.
The majority of projects run into Data leak issues with Kafka. Make sure you have a robust monitoring/logging system. Use JMX exporters to get easy access to Kafka metrics.
Use ACL tokens for your topic security
Enable compression — Compression uses more CPU usage but will reduce the amount of IO.

Glossary/Terminology

Broker — Servers where records are stored. Multiple brokers can be used to form a cluster.
Topics — A topic is a category to which data records or messages are published. Consumers subscribe to topics in order to read the data written to them.
Partition: Defines a non-overlapping subset of records within a topic.
Producers — Producers publish messages to Kafka topics.
Consumers — Consumers read messages from Kafka topics by subscribing to topic partitions.
Offset — A unique sequential number assigned to each record within a topic partition.
Record/Message — A record contains a key, a value, a timestamp, and a list of headers.
Consumer Group — A consumer group is a group of related consumers that perform a task. Consumer groups each have unique offsets per partition
Acks — Producers can choose to receive acknowledgments for the data writes to the partition using the “acks” setting.
Replicas — Replication is the process to offer fail-over capability for a topic.
Source Connector — Allows importing data from Source system to Kafka
Sink Connector — Allows to export data from Kafka to Sink (Target) system

Resources

confluentinc/examples

This is a curated list of demos that showcase Apache Kafka® event stream processing on the Confluent Platform, an event…

github.com

confluentinc/demo-scene

Scripts and samples to support Confluent Platform talks. May be rough around the edges. For automated tutorials and…

github.com

Kafka Tutorials: Learn Kafka with End-to-End Code Examples

Find, learn, and contribute Apache Kafka tutorials with full code examples for real use cases. Learn Kafka from…

kafka-tutorials.confluent.io

Slack

Edit description

confluentcommunity.slack.com

infoslack/awesome-kafka

This list is for anyone wishing to learn about Apache Kafka, but do not have a starting point. You can help by sending…

github.com

Zoe — New Kafka client in town

Yet another command-line Kafka tool but with good features. If you are a fan of Command-line tools then definitely…

medium.com

Kafka series blog posts

Part 1 — What is Kafka? Getting Started with Kafka Ecosystem

I hope this post has helped you. If you enjoyed this article, please don’t forget to clap 👏, comment, and share! I would love to know what you think and would appreciate your thoughts on this topic. You can also follow me on Medium, GitHub, and Twitter for more updates.

Part 1 — What is Kafka? Getting Started with Kafka Ecosystem

What is Kafka?

History of Kafka

Kafka Use Cases

Benefits

Limitations

Kafka has five core APIs

Home

Confluent, founded by the creators of Apache Kafka, delivers a complete execution of Kafka for the Enterprise, to help…

Support Message Formats

Apache AVRO

Welcome to Apache Avro!

Apache Avro™ is a data serialization system. To learn more about Avro, please read the current documentation. To…

JSON

JSON

Edit description

PROTOBUF

EcoSystem/Components Overview

Apache zookeeper (Open Source)

Apache Kafka aka Broker (Available as Open Source & Confluent License)

Kafka Connect (Open Source)

Kafka REST Proxy (Open Source)

Kafka Schema Registry (Open Source)

KSQL DB Server (Open Source)

Control Center (Confluent License)

Tools

Kafkacat (Open Source)

edenhill/kafkacat

Copyright (c) 2014-2020 Magnus Edenhill https://github.com/edenhill/kafkacat kafkacat logo by @dtrapezoid kafkacat is a…

Kafdrop (Open Source)

obsidiandynamics/kafdrop

Kafdrop is a web UI for viewing Kafka topics and browsing consumer groups. The tool displays information such as…

Lense.io (Paid)

Operatr (Paid)

Zoe (Open Source)

Zoe - The Kafka Companion

 Zoe is a command line tool to interact with kafka in an easy and intuitive way. Wanna see this in action ? check out…

Kukulcan — Scala/Java/REPL (Open Source)

mmolimar/kukulcan

K'uk'ulkan ("Feathered Serpent") is the name of a deity which was workshipped by the Yucatec maya people. You can read…

Best Practices

Glossary/Terminology

Resources

confluentinc/examples

This is a curated list of demos that showcase Apache Kafka® event stream processing on the Confluent Platform, an event…

confluentinc/demo-scene

Scripts and samples to support Confluent Platform talks. May be rough around the edges. For automated tutorials and…

Kafka Tutorials: Learn Kafka with End-to-End Code Examples

Find, learn, and contribute Apache Kafka tutorials with full code examples for real use cases. Learn Kafka from…

Slack

Edit description

infoslack/awesome-kafka

This list is for anyone wishing to learn about Apache Kafka, but do not have a starting point. You can help by sending…

Zoe — New Kafka client in town

Yet another command-line Kafka tool but with good features. If you are a fan of Command-line tools then definitely…

Kafka series blog posts

Written by Mayank Patel

No responses yet