Kafka & Kafka Providers

Published in

Blue Harvest Tech Blog

17 min readJul 6, 2022

Kafka is one of the top solutions on the market for data streaming, used by thousands of companies. Its popularity is continuously increasing due to the high efficiency, availability, and flexibility that this messaging technology offers. It is therefore definitely a technology that is worth learning for IT professionals. With this article, you will get a better understanding of the benefits, key concepts, components, and architecture of Kafka as well as a description of Kafka deployment options and vendors.

Kafka is an open-source framework under Apache 2.0 license. Apache Kafka® is a high-throughput, distributed, fault-tolerant, and enterprise-ready event-streaming platform. It provides a combination of messaging, storage, processing, and integration of high volumes of data at scale in real-time, and is fault-tolerant.

Some of the benefits of Apache Kafka are:

● Tracking web activities by storing/sending the events for real-time processes.

● Alerting and reporting the operational metrics.

● Transforming data into the standard format.

● Continuous processing of streaming data to the topics.

Kafka is different from traditional message queues (like RabbitMQ) in that:

Kafka retains the message after it was consumed for a period of time (default is 7 days), while RabbitMQ removes messages immediately after the consumer’s confirmation was received.

RabbitMQ pushes messages to consumers and keeps track of their load. It decides how many messages should be in processing by each of the consumers (there are settings for this behavior). Kafka supports fetching messages by consumers (pulling). It is designed to be ready to scale horizontally, by adding more nodes while traditional messaging queues expect to scale vertically, by adding more power to the same machine.

Kafka Key Concepts

Some of the Key Concepts :

>Producer: Applications that publish a stream of records to one or more Kafka topics.

>Consumer: Applications that read data from Kafka topics.

>Kafka Topic: It is a stream of data and is composed of individual records.

>Kafka Brokers: Stores data provided by the producer and handles all requests from clients.

>Kafka Partition: The main concurrency mechanism in Kafka.

>Kafka Cluster: Consists of one or more servers (Kafka brokers) running Kafka.

>Kafka Zookeeper: Performs the Management of the brokers, topics & users in the cluster. Kafka can be run without ZooKeeper from Kafka v2.8.0 released in April 2021. Starting with Kafka v2.8, Kafka can be run without ZooKeeper. This sans-Zookeeper mode is formally named Kafka Raft Metadata mode. However, the developers shorted it to KRaft mode and are pronouncing it like the word “craft”

Important things to note:

•Kafka clusters can have one or more brokers.

•Brokers can host multiple replicas.

•Topics can have one or more partitions.

•A broker can host zero or one replica per partition.

•A partition has one leader replica and zero or more follower replicas.

•Each replica for a partition needs to be on a separate broker.

•Every partition replica needs to fit on a broker, and a partition can’t be divided over multiple brokers.

•Every broker can have one or more leaders, covering different partitions and topics.

Kafka Component Overview

The main Kafka components are topics, producers, consumers, consumer groups, clusters, brokers, partitions, replicas, leaders, and followers.

The diagram above offers a simplified look at the interrelations between these components. Note the following when it comes to brokers, replicas, and partitions:

● Kafka clusters may include one or more brokers.

● Kafka brokers are able to host multiple partitions.

● Topics are able to include one or more partitions.

● Brokers are able to host either one or zero replicas for each partition.

● Each partition includes one leader replica, and zero or greater follower replicas.

● Each of a partition’s replicas has to be on a different broker.

● Each partition replica has to fit completely on a broker and cannot be split onto more than one broker.

● Each broker can be the leader for zero or more topic/partition pairs.

Now let’s look at a few examples of how producers, topics, and consumers relate to one another:

Here we see a simple example of a producer sending a message to a topic, and a consumer that is subscribed to that topic reading the message.

The below diagram demonstrates how producers can send messages to singular topics:

Consumers can subscribe to multiple topics at once and receive messages from them in a single poll (Consumer 3 in the diagram shows an example of this).

Now let’s look at a producer that is sending messages to multiple topics at once, in an asynchronistic manner:

Technically, a producer may only be able to send messages to a single topic at once. However, by sending messages asynchronously, producers can functionally deliver multiple messages to multiple topics as needed.

Kafka API Architecture

Apache Kafka offers four key main APIs:

Producer API :

The Kafka Producer API enables an application to publish a stream of records to one or more Kafka topics.

Consumer API :

The Kafka Consumer API enables an application to subscribe to one or more Kafka topics. It also makes it possible for the application to process streams of records that are produced on those topics.

Streams API :

The Kafka Streams API allows an application to process data in Kafka using a stream processing paradigm. With this API, an application can consume input streams from one or more topics, process them with stream operations, and produce output streams and send them to one or more topics. In this way, the Streams API makes it possible to transform input streams into output streams.

Connect API :

The Kafka Connector API connects applications or data systems to Kafka topics. This provides options for building and managing the running of producers and consumers and achieving reusable connections among these solutions. For instance, a connector could capture all updates to a database and ensure those changes are made available within a Kafka topic.

Kafka Deployment Options and Vendors

Kafka can be deployed in the cloud (Private, Public, Hybrid) and on-premise based on the different types of vendors listed below:

The different Kafka vendors can be grouped into the following:

(1) Open-Source Vendor: Apache website under Apache 2.0 license

(2) Self-managed: Confluent Platform, Microsoft Azure HDInsight, Instaclustr, Cloudera, Red Hat.

(3) Fully-Managed Cloud Offering: Confluent Cloud, Amazon MSK, Keen,

(4) Managed Kafka-as-a-Service : Aiven, Instaclustr Managed Kafka*, Cloudkarafka*

Managed Kafka-as-a-service platforms to handle deploying, maintaining, and updating Apache Kafka Infrastructure allowing teams to focus their IT resources on product development. Data infrastructure is configurable via a UI and is compatible with additional solutions for data storage like Apache.

Some popular Kafka vendors are listed below:

- Amazon MSK

Amazon Managed Streaming for Apache Kafka (MSK), runs open-source versions of Apache Kafka on AWS with high availability and security. The features are:

● Fully compatible

● No Servers to manage

● Highly available & highly secure

● Scalable

● Deeply Integrated

● Configurable

You can specify and then scale cluster capacity to meet your needs.

You pay an hourly rate for Apache Kafka broker instance usage (billed at one-second resolution), with varying fees depending on the size of the broker instance and active brokers in your Amazon MSK clusters.

You also pay for the amount of storage you provision in your cluster. This is calculated by adding up the GB provisioned per hour and dividing by the total number of hours in the month, resulting in a “GB-months” value.

You are not charged for data transfer between brokers or between Apache ZooKeeper nodes and brokers. You will pay standard AWS data transfer charges for data transferred in and out of Amazon MSK clusters.

You don’t need to specify or scale cluster capacity

You pay an hourly rate for your serverless clusters and an hourly rate for each partition that you create. Additionally, you pay per GB of data that your producers write to and your consumers read from the topics in your cluster. You can also retain data for up to 1 day. Amazon MSK charges you only for the storage you consume

You pay an hourly rate for connector usage (billed at one-second resolution), with varying fees depending on the number of workers you use for your connector and the size of each worker, measured in a number of MSK Connect Units (MCUs). Each MCU provides 1 vCPU of compute and 4 GB of memory.

Note: Interesting side note about the commercial support and SLAs of AWS’s Kafka offering: Kafka is excluded from MSK support! Quote from the MSK SLAs: “The Service Commitment DOES NOT APPLY to any unavailability, suspension, or termination… caused by the underlying Apache Kafka or Apache ZooKeeper engine software that leads to request failures…”

The diagram demonstrates the interaction between the following components:

•Broker nodes — When creating an Amazon MSK cluster, you specify how many broker nodes you want Amazon MSK to create in each Availability Zone. In the example cluster shown in this diagram, there’s one broker per Availability Zone. Each Availability Zone has its own virtual private cloud (VPC) subnet.

•ZooKeeper nodes — Amazon MSK also creates the Apache ZooKeeper nodes for you. Apache ZooKeeper is an open-source server that enables highly reliable distributed coordination.

•Producers, consumers, and topic creators — Amazon MSK lets you use Apache Kafka data-plane operations to create topics and to produce and consume data.

•Cluster Operations You can use the AWS Management Console, the AWS Command Line Interface (AWS CLI), or the APIs in the SDK to perform control-plane operations. For example, you can create or delete an Amazon MSK cluster, list all the clusters in an account, view the properties of a cluster, and update the number and type of brokers in a cluster.

Amazon MSK detects and automatically recovers from the most common failure scenarios for clusters so that your producer and consumer applications can continue their write and read operations with minimal impact. When Amazon MSK detects a broker failure, it mitigates the failure or replaces the unhealthy or unreachable broker with a new one. In addition, where possible, it reuses the storage from the older broker to reduce the data that Apache Kafka needs to replicate. Your availability impact is limited to the time required for Amazon MSK to complete the detection and recovery. After recovery, your producer and consumer apps can continue to communicate with the same broker IP addresses that they used before the failure.

- Apache Kafka in Azure HDInsight

The following are specific characteristics of Kafka on HDInsight:

● Provides a 99.9% Service Level Agreement (SLA) on Kafka uptime.

● It uses Azure Managed Disks as the backing store for Kafka. Managed Disks can provide up to 16 TB of storage per Kafka broker.

● Provides tools that rebalance Kafka partitions and replicas across UDs and FDs0

● HDInsight Kafka does not support downward scaling or decreasing the number of brokers within a cluster but Upward scaling can be performed.

● Azure Monitor logs can be used to monitor Kafka on HDInsight

Apache ZooKeeper manages the state of the Kafka cluster. Zookeeper is built for concurrent, resilient, and low-latency transactions.

Kafka stores records (data) in topics. Records are produced by producers and consumed by consumers. Producers send records to Kafka brokers. Each worker node in your HDInsight cluster is a Kafka broker.

Topics partition records across brokers. When consuming records, you can use up to one consumer per partition to achieve parallel processing of the data.

Replication is employed to duplicate partitions across nodes, protecting against node (broker) outages. A partition denoted with an (L) in the diagram is the leader for the given partition. Producer traffic is routed to the leader of each node, using the state managed by ZooKeeper.

- CloudKarafka

CloudKarafka is a fully managed Kafka cluster available on AWS and Google Cloud.

Some of the features are:

I. FULLY MANAGED APACHE KAFKA CLUSTERS

CloudKarafka automates every part of the setup, running, and scaling of Apache Kafka. Just click a button and you’ll have a fully managed Kafka cluster up and running within two minutes.

II. EASY MONITORING & CUSTOM ALARMS

Our control panel offers various tools and integrations for monitoring, metrics, and alarms. It’s super easy to set up custom alarms via email or push notifications to external services.

III. OUTSTANDING SUPPORT

We provide 24/7 support to thousands of customers. We’ve been providing the service for years and have excellent operation experience from many customers.

IV. SCALING & UPGRADING

You can scale your cluster without downtime when using CloudKarafka. The same goes for upgrading your server to a new version of Scala, Java, or Apache Kafka — just click a button and relax.

V. KAFKA REST PROXY

The Kafka REST Proxy gives you the opportunity to produce and consume messages over a simple REST API, which makes it easy to view the state of the cluster and perform administrative actions without using native Kafka clients.

VI. MANAGED ZOOKEEPER

Zookeeper is a top-level software developed by Apache that acts as a centralized service and it keeps track of the status of your Kafka cluster nodes. It also keeps track of Kafka topics, partitions, etc. All our plans include managed Zookeeper clusters.

VII. KAFKA CONNECT

With Kafka Connect, you’re able to integrate your Kafka cluster easily with other systems, and stream data in a scalable and secure manner.

VIII. SCHEMA REGISTRY

Via the Schema Registry, you’re able to control and follow all event types of your Apache Kafka message schemas.

XI. SERVICE INTEGRATIONS

Integrate your Kafka cluster alarms, log, and metrics, with services such as Pagerduty, VictorOps, or OpsGenie.

- Instaclustr Managed Kafka

Instaclustr provides a production-ready Kafka cluster with the click of a button and is backed by our industry-leading SLAs and expert support team. Some of the features are:

● 100% open-source Apache Kafka

● Flexible Hosting Options; AWS, Azure, GCP, DigitalOcean, IBM Cloud, On-Prem

● Terraform-based provisioning

● Prometheus monitoring API

● Source-code-level expertise in Kafka and adjacent technologies

● Up to 99.999% availability SLA

- Instaclustr Kafka Platform

Instaclustr for Apache Kafka®:

Enables robust, scalable stream processing and event-driven architectures for enterprise companies and startups.

Some of the benefits are; Scalability, Fault tolerance, Zero downtime migrations, Optimized configuration, Simple provisioning, Automated health checks, and 100% Open Source

Instaclustr for Kafka® Connect :

Provides a fully managed service for Kafka® Connect — SOC 2 certified and hosted in the cloud or on-prem.

Kafka Connect can be rapidly deployed and scaled so data can be pushed to and pulled from Kafka without the need to write any code

Instaclustr for Apache ZooKeeper®

This is a fully managed service for Apache ZooKeeper™ — SOC 2 certified and hosted on AWS.

Make it easy to coordinate and manage distributed applications.

- Aiven–Kafka as a Service

Aiven for Apache Kafka is a fully managed streaming platform, deployable in the AWS. Google Cloud, Microsoft Azure, DigitalOcean, UpCloud. The key features are:

● Integration with your workflow(e.g SAML Authentication, DataDog, etc.)

● Kafka Connect

● Kafka REST

● Schema Registry

● Kafka MirrorMaker

● Terraform support

● Kubernetes support

● Multi-AZ placement

● Virtual Private Cloud (VPC) peering

● Management Dashboard with ACLs

● Flexible Authentication Methods.

- Aiven Products for Kafka

Aiven for Apache Kafka

● With features such as Kafka Connect as a Service,

● Schema Registry,

● REST,

● Account Control Lists etc.

Aiven for Apache Kafka MirrorMaker 2

● Allow the organization to confidently integrate replication workflows into your production environments.

● Integrate with your workflow

● Multi-AZ placement

● Virtual Private Cloud (VPC) peering

● Terraform support

Aiven for Apache Kafka Connect

● With an Apache Kafka Connect connector, you can source data from an existing technology into a topic or sink data from a topic to a target technology by defining the endpoints.

● over 20 open-source Kafka connectors

- Keen

Keen is a complete event streaming platform (an all-in-one event streaming and analytics solution that offers users access to pre-configured big data infrastructure as a service) meant for Small Medium Businesses and Small Medium Enterprises.

The features are:

● Built on Apache Kafka®, easily collect event data from anywhere, add rich attributes, and send it to wherever you need it.

● Keen is the Data enrichment, persistent storage, real-time analytics, and embeddable data visualizations are included as part of the platform.

● Collect real-time data from anywhere with Keen Streams API

● Securely store data with multi-layer AES encryption

● Run queries to analyze your stored data with Keen Compute API

● Deploy white-labeled, in-app dashboards with our Data Viz Library

- Confluent

Confluent is a Leading Apache Kafka Vendor. The key facts about Confluent Kafka are:

● Focus on event streaming and original creators of Kafka.

● The main contributor to the Apache Kafka project with 80% of Kafka commits.

● Always the latest Kafka version and full support.

● Rich Kafka ecosystem (connectors, governance, security, etc.)

● Hybrid architectures.

● Partnership and 1st party integration into cloud providers (AWS, GCP, Azure) — e.g., you can use your cloud provider credits and account to consume Confluent Cloud.

- Confluent Deployment Options

(1) Confluent Cloud -Hybrid, Multi-Region, and Multi-Cloud (2) Confluent Platform

Confluent Cloud

Confluent Cloud is a fully managed, cloud-native service for connecting and processing all of your data, everywhere it’s needed. It is a resilient, scalable streaming data service based on Apache Kafka®, delivered as a fully managed service.

Confluent Cloud has a web interface and local command-line interface. You can manage cluster resources, settings, and billing with the web interface. You can use Confluent CLI to create and manage Kafka topics.

Some of the features of Confluent Cloud are:

(i) Massive scale without the ops overhead

● Self-service provisioning with no complex cluster sizing

● Serverless scaling between 0–100 MBps

● On-demand, programmatic expand & shrink for GBps+ use cases

● Zero-downtime Kafka upgrades & bug fixes

● Pay only for what you actually use

(ii) Build for hybrid and multi-cloud

● Build a persistent bridge from on-premises to cloud with a hybrid Kafka service

● Stream across public clouds for multi-cloud data pipelines

(iii) Simplify planning with no-limit storage

● Infinite data storage and retention within Kafka topics (AWS, Azure, & Google Cloud)

(iv) Do more with data in motion

● Integrate all your systems quickly with 120+ connectors

● Process data in real-time with fully-managed ksqlDB

● Discover, understand and trust your event streams with a fully managed governance suite for data in motion

(v) Reliably scale mission-critical apps

● Guaranteed 99.99% uptime SLA

● Scale to GBps+ with dedicated capacity

● Multi-AZ Replication

● Topic-level and cluster-level health metrics

● Seamless integration with monitoring tools such as Datadog and Prometheus

(v) Run with enterprise-grade security & compliance

● At-rest & in-transit data encryption with BYOK options (AWS & Google Cloud)

● SAML/SSO for user authentication

● Kafka ACLs and Cluster Role-Based Access Control (RBAC) for authorization

● Private networking via VPC/VNet peering, AWS PrivateLink, Azure Private Link, and AWS Transit Gateway (Dedicated clusters)

● Activity monitoring with platform-wide audit logs

(vi) Scripting and workflow automation enabled at scale

● One CLI for managing all of Confluent, across clouds and on-premises

● REST APIs for custom, programmatic management of clusters, service accounts, API keys, topics, connectors, and more

● Terraform provider (preview) for repeatable, scalable infrastructure-as-code

Confluent Platform

Confluent Platform is a full-scale data streaming platform that enables you to easily access, store and manage data as continuous, real-time streams. Built by the original creators of Apache Kafka®, Confluent expands the benefits of Kafka with enterprise-grade features while removing the burden of Kafka management or monitoring. Some of the features are:

● Unrestricted developer productivity

● Efficient operations at scale

● Production-stage prerequisites

● Freedom of choice

● Enterprise-grade Security

● Available with Open Source, Community feature, and Commercial feature

Each release of Confluent Platform includes the latest release of Kafka and additional tools and services that make it easier to build and manage an Event Streaming Platform. Confluent Platform delivers both community and commercially licensed features that complement and enhance your Kafka deployment.

Confluent Platform is available in two flavors:

(i) Confluent Open Source: This is 100% open-source. In addition to the components included with Apache Kafka, Confluent open source includes services and tools that are frequently used with Kafka. This makes Confluent Open Source the best way to get started with setting up a Kafka-based streaming platform. Confluent Open Source includes clients for C, C++, Python, and Go programming languages; connectors for JDBC, ElasticSearch, and HDFS; Schema Registry for managing metadata for Kafka topics; and REST Proxy for integrating with web applications.

(ii) Confluent Enterprise: This takes it to the next level by addressing the requirements of modern enterprise streaming applications. It includes Confluent Control Center for end-to-end monitoring of event streams, MDC Replication for managing multi-datacenter deployments, and Automatic Data Balancing for optimizing resource utilization and easy scalability of Kafka clusters. Collectively, the components of Confluent Enterprise give your team a simple and clear path towards establishing a consistent yet flexible approach for building an enterprise-wide streaming platform for a wide array of use cases.

Hybrid Cloud & Multi-Cloud

Confluent’s solution for hybrid and multi-cloud architectures helps organizations accelerate development velocity, realize the benefits of the cloud faster, and deliver a new class of real-time applications that power rich, frontend customer experiences and efficient, backend operations. All of an organization’s datastores, applications, and systems can now operate on a singular and consistent real-time view of an organization’s data, eliminating the need for periodic batch jobs that create inconsistent copies of data in different places at different times.

The hybrid and multi-cloud use cases are:

Data warehouse modernization

Leverage modern cloud data warehouse services using Confluent as a persistent, real-time, bi-directional bridge between existing and new data warehouse systems. Deliver connected, real-time analytics with Confluent.

App modernization

Confluent modernizes any system with an event-driven architecture. The second an event happens, services update data in real-time for seamless integration, data consistency, and scalable, responsive microservices orchestration.

Disaster recovery

Deploying a disaster recovery strategy with Confluent can increase the availability and reliability of your mission-critical applications by minimizing data loss and downtime during unexpected disasters, like public cloud provider outages.

Mainframe augmentation

Leverage Confluent’s data in motion platform to unlock your mainframe data for real-time insights without incurring the complexity and expense that come with sending ongoing queries to mainframe databases