Part 1 — What is Kafka? Getting Started with Kafka Ecosystem
TL;DR The Goal of this post is to share very high-level knowledge about what Kafka is?, use cases, core APIs, components, some of the available tools, best practices/gotchas, etc. This is the first post towards a series of blog posts for Kafka.
What is Kafka?
Apache Kafka is a distributed streaming platform that is effective and reliable when handling massive amounts of incoming data from various sources heading into the numerous outputs.
Apache Kafka® is a distributed streaming platform
History of Kafka
Kafka Use Cases
- Event Sourcing
- Messaging
- Log Aggregation
- Stream Processing
- Pub/Sub
- Activity
- Modern Extract, Transform, and Load (ETL)
- Change Capture
- Big Data/Analytics
- Fraud/Anomaly Detection
- Many more
Benefits
- Low Latency
- High Throughput
- Fault-Tolerant
- Scalability
- High Concurrency
- By Default Persistent
- Immutable
- Reliable and durable
- Distributed
Limitations
- Issues with Message Tweaking
- No support for Wildcard Topic Selection
Kafka has five core APIs
- Producer API — The Producer API allows applications to send streams of data to topics in the Kafka cluster.
- Consumer API — The Consumer API allows applications to read streams of data from topics in the Kafka cluster.
- Streams API — The Streams API allows transforming streams of data from input topics to output topics. The future is very bright to implement real-time analytics systems using Streams API. However, there are lots of performance issues with certain stream functions you have to be careful about. I will cover more examples of Stream API in a future article dedicated to stream and some of its gotchas.
- Connector API — The Connect API allows implementing connectors that continually pull from some source data system into Kafka or push from Kafka into some sink data system. Many prebuilt connectors are available for different source and sink systems. SDK also supports the ability to implement custom connectors.
- Admin API — The Admin API supports managing and inspecting topics, brokers, ACLS, and other Kafka objects.
Support Message Formats
Apache AVRO
Avro provides:
- Rich data structures.
- A compact, fast, binary data format.
- A container file, to store persistent data.
- Remote procedure call (RPC).
- Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.
- Avro relies on schemas.
JSON
JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate.
PROTOBUF
Protocol Buffers is a high-performance, compact binary wire format invented by Google that uses it internally so they can communicate with their internal network services at a very high speed.
As of Confluent Version 5.5 (Apache Kafka 2.5.x) now supports schema for JSON and PROTOBUF formats as well. Prior versions only supported schema integration for AVRO format.
EcoSystem/Components Overview
Apache zookeeper (Open Source)
Apache Zookeeper is a centralized service for maintaining metadata about a cluster of distributed nodes like Configuration information, Heath status, Group membership, etc.
Confluent is in motion to eliminate Apache Zookeeper as a meta store from Kafka Ecosystem by adding self-managed metadata quorum KIP-500
Apache Kafka aka Broker (Available as Open Source & Confluent License)
The Broker is the heart of the Kafka Ecosystem. A Kafka server that manages the persistence and replication of message data (i.e., the commit log).
Kafka Connect (Open Source)
Kafka has a built-in framework called Kafka Connect for writing sources and sinks that either continuously ingests data into Kafka or continuously ingest data in Kafka into external systems. The connectors themselves for different applications or data systems are federated and maintained separately from the main codebase. You can find a list of available connectors at the Confluent Hub.
Kafka REST Proxy (Open Source)
REST Proxy provides a RESTful interface to a Kafka cluster, making it easy to produce and consume messages, view the state of the cluster, and perform administrative actions without using the native Kafka protocol or clients.
Currently supporting — Metadata, Producers, Consumers, Data Formats, etc.
Limitation — Supports only SSL Cert based security. No support for extending ACL. The REST Proxy adds extra processing to affect performance.
Kafka Schema Registry (Open Source)
Confluent Schema Registry provides a serving layer for your metadata. It provides a RESTful interface for storing and retrieving your Avro®, JSON Schema, and Protobuf schemas. It stores a versioned history of all schemas based on a specified subject name strategy, provides multiple compatibility settings, and allows the evolution of schemas according to the configured compatibility settings and expanded support for these schema types. It provides serializers that plug into Apache Kafka® clients that handle schema storage and retrieval for Kafka messages that are sent in any of the supported formats. Schema Registry lives outside of and separately from your Kafka brokers.
KSQL DB Server (Open Source)
KsqlDB is an event streaming database purpose-built to help developers create stream processing applications on top of Apache Kafka®.
Control Center (Confluent License)
Confluent Control Center is a web-based tool for managing and monitoring Apache Kafka®. Control Center facilitates building and monitoring production data pipelines and streaming applications.
More options bring more confusion to select the correct API. Be mindful of selecting API/components for your use case.
e.g.: To produce data you can build custom producers or use Kafka Connect or Kafka REST Proxy or tools like FluentBit to bring data into Kafka.
Tools
Kafkacat (Open Source)
One of the most commonly used Command Line Interface (CLI) to interact with Kafka for as Publisher and Consumer. Very mature Kafka tool, good documentation with lots of examples.
Rate: 4.5 (Have room for improvement in terms of filtering data points and querying certain data.)
Kafdrop (Open Source)
Kafdrop is a web UI for viewing Kafka topics and browsing consumer groups. The tool displays information such as brokers, topics, partitions, consumers, and lets you view messages. The interface is very buggy and not a great experience over some of the other command-line tools.
Rate: 2 (Good option for folks who like to interact with web interface. Not a robust system to assess your Kafka data)
Lense.io (Paid)
Lense.io provides a product that has a mature user interface and comes in as a Paid option only. AWS partnered with lense.io to be more strategic to advance recently launched MSK solution as of 2018 re:Invent.
Rate: 3.5 (If you can afford paid version and looking for We based user interface)
Operatr (Paid)
Operatr provides a fully-featured web interface with Monitoring for Kafka. It is only available as a Paid option. It provides really deep monitoring features to support the Kafka ecosystem.
Rate: 4.5 (Again If you can afford paid version and looking for We based user interface)
Zoe (Open Source)
One of the latest Kafka clients. It seems very slick in terms of features it brings. However, it is very slow, barely considered as solid documentation. I am impressed with features so definitely looking forward to seeing this CLI get mature. Zoe shines in Kubernetes environments since it can leverage Kubernetes scaling for consumption and execution.
Rate: 3.5 (Currently ratting only 3 since it is less than a couple of weeks into the first release and definitely needs performance improvement and documentation improvements)
Kukulcan — Scala/Java/REPL (Open Source)
Again new kid in a town as Scala/Java REPL for Kafka. Kukulcan provides an API and different sorts of REPLs to interact with your streams or administer your Apache Kafka deployment. Depends on at least JDK 11 and SBT.
Rate: N/A (Have not had the opportunity to explore yet so can not rate as this point)
Best Practices
Avoid default configuration for Kafka components
- Things like disable topic auto-creation
- Override default topic partition and replication factor
- Use appropriate memory to avoid reading from disk.
- Use data compression
Avoid default configuration for Publishers
- acknowledgment aka acks — how many in sync replicas need to acknowledge the message when publishing a message to the topic. Affects performance.
- retries — number of retries if publishing message fails
- max.in.flight.requests — number of max messages being processed which are not acknowledged.
- Understand your data distribution to select the appropriate key for topic partition.
Avoid consumer configurations
- Polling
- Design consumers for high throughput. For throughput tune
batch.size
andbuffer.memory
configurations
General
- Review the fundamentals and internals of APIs you are planning to leverage.
- Make sure to perform a load test before you take your Kafka architecture to production.
- The majority of projects run into Data leak issues with Kafka. Make sure you have a robust monitoring/logging system. Use JMX exporters to get easy access to Kafka metrics.
- Use ACL tokens for your topic security
- Enable compression — Compression uses more CPU usage but will reduce the amount of IO.
Glossary/Terminology
- Broker — Servers where records are stored. Multiple brokers can be used to form a cluster.
- Topics — A topic is a category to which data records or messages are published. Consumers subscribe to topics in order to read the data written to them.
- Partition: Defines a non-overlapping subset of records within a topic.
- Producers — Producers publish messages to Kafka topics.
- Consumers — Consumers read messages from Kafka topics by subscribing to topic partitions.
- Offset — A unique sequential number assigned to each record within a topic partition.
- Record/Message — A record contains a key, a value, a timestamp, and a list of headers.
- Consumer Group — A consumer group is a group of related consumers that perform a task. Consumer groups each have unique offsets per partition
- Acks — Producers can choose to receive acknowledgments for the data writes to the partition using the “acks” setting.
- Replicas — Replication is the process to offer fail-over capability for a topic.
- Source Connector — Allows importing data from Source system to Kafka
- Sink Connector — Allows to export data from Kafka to Sink (Target) system