Kappa Architecture Design: Real-Time Analytics in Open Source Environments

Published in

Data Reply IT | DataTech

12 min readApr 24, 2024

Big Data Gets a Speed Upgrade: Why Real-Time Matters

In today’s data-driven world, the ability to harness and analyze data in real-time has become a critical factor for businesses and organizations across various domains. In fact, it is projected that by 2025, real-time data alone will account for a quarter of the 163ZB global datasphere which encompasses all data generated, collected, and duplicated on our planet¹.

The Big Data revolution was built on the promise of harnessing massive datasets. But the business world craves not just volume, but also velocity. Fresh data is key to identifying trends, spotting issues, and making decisions faster. This is especially true in fast-paced fields like finance, healthcare, and telecom, where real-time insights can have a major impact.

The traditional approach of batch processing data just doesn’t cut it anymore: data engineers are increasingly turning to velocity-driven architectures that analyze data as it streams in, instead of waiting for the next batch of data to be fulfilled, processed and archived in a data lake. Real-time analysis allows companies to react quickly, isolate problems before they snowball, and make data-driven decisions on the fly.

Enter the Kappa Architecture: Streamlining Real-Time Processing

So, how do we actually implement this velocity-driven approach? Here’s where the Kappa architecture² comes in. Introduced in 2014 by Jay Kreps, creator of Apache Kafka and CEO of Confluent, the Kappa architecture offers a compelling solution for real-time data processing.

Its beauty lies in its simplicity and flexibility: it unifies data processing into a single framework, not only enabling the processing of low-latency data streams from sources like IoT devices, social media and application usage logs, but also providing data retention capabilities that allow for the reprocessing of the ingested data whenever there are changes in input or transformation logic. In fact, unlike the Lambda architecture³ that requires a separate batch layer for historical data recomputations, the Kappa architecture shines with its ability to handle everything in a single stream. This eliminates the need for complex, multi-system setups and saves you the headache of developing and maintaining the transformation logic twice, once for the batch layer and once for the speed layer.

Seeing is believing: sometimes, a picture is worth a thousand words. Let’s compare the visual representations of Kappa and Lambda architectures to grasp the core difference between them.

One Processing Layer to Rule Them All

The Kappa architecture simplifies the design by ditching the batch layer altogether and designate the speed layer as the only processing actor in the pipeline. Such a design choice is motivated first of all by the technological advances in streaming platforms: the diffusion of Apache Kafka and the implementation of streaming services both in Google Cloud Platform (Dataflow) and Amazon Web Services (Kinesis), made such infrastructure products not only very popular, but also very versatile and powerful: streaming engines aren’t just about real-time processing anymore, they’ve evolved into powerful frameworks that can not only deliver results with low latency, but also continuously recompute historical data on the fly. This “replay” functionality eliminates the need for a separate batch layer.

Kappa in the Real World: Open Source Solutions

As with any architectural approach, the real-world performance of the Kappa architecture can vary significantly depending on its implementation and deployment. Open-source solutions provide flexibility and customization options but may require extensive tuning and operational expertise. On the other hand, managed cloud services offer ease of deployment and scalability but often come with cost considerations and potential limitations in customization.

This guide will walk you through the design process for your Kappa architecture, taking into account open-source solutions that boast not only a significant community but also years of experience in the field of some of the leading industries in the world of real-time data such as Meta, LinkedIn, Yahoo.

First, let’s explore the core architectural components. These components can be logically divided into three key layers:

Ingestion Layer: This layer focuses on data acquisition. It handles the collection of data streams from various sources and feeds them into the processing pipeline.
Processing Layer: This is the heart of the system. Here, the incoming data streams are transformed and analyzed in real-time.
Serving Layer: Once processed, the data is made readily available for querying and analysis by applications and users through the serving layer.

Ingestion Layer

Being the entry point to the Kappa architecture, the ingestion layer is a critical component that requires a significant amount of time to be correctly set up and fine-tuned. The most suitable technologies used to implement the ingestion layer heavily depend both on the functional and non-functional characteristics of the data sources to be ingested. Here’s a breakdown of common data sources that feed the ingestion layer:

Intrinsically Streaming Sources: in many use cases, data streams flow directly from devices like IoT sensors, smart meters, and social media platforms using dedicated clients.
Adapted Data Sources: for some use cases, the data source might not be inherently a streaming source, like traditional databases. In these cases, Change Data Capture (CDC) systems like Debezium can be used to transform conventional databases into streaming sources.

Once you’ve identified your data source, the next critical step is selecting an appropriate streaming platform, which brings to a real-time data processing system the ability to extend its range of action also to historical data, thanks to the features of retention and distributed durable storage.

While streaming platforms share some similarities with message queue systems (producers publish information, consumers subscribe and process it), they have fundamental architectural differences:

Message Delivery: streaming platforms utilize a pull-based approach. Consumers actively interact with the platform, tracking the last message they read and leveraging the same log-based structures used for data retention. In contrast, message queue systems operate with a push-based approach. Consumers passively wait for messages in destination queues, with messages deleted upon reading.
Scalability and Availability: Streaming platforms excel in horizontal scalability and availability. This is achieved through techniques like stream partitioning and synchronization protocols. However, they often lack the strict message ordering and delivery guarantees that are a hallmark of message queue systems.

Among the open source solutions, it is worth mentioning:

Apache Kafka: built with a log-centric approach, Kafka acts as a distributed commit log, efficiently handling high-throughput data streams across horizontally scaled servers. Kafka’s architecture revolves around producers, which push messages into the system, and consumers, which actively pull them in. To ensure consumers don’t miss messages (or get duplicates), they maintain offsets, which act like bookmarks keeping track of their progress within a specific data stream. To handle large volumes and provide fault tolerance, consumers can be grouped, with each member responsible for processing a specific subset of the data, effectively distributing the workload. Kafka employs a logical organization system for these messages using topics and partitions. Topics act as categories, allowing you to group related messages together. But for true scalability, topics are further subdivided into partitions — individual append-only logs stored on separate brokers (servers) in the Kafka cluster. This partitioning scheme is the secret sauce behind Kafka’s horizontal scaling capabilities. As your data volume grows, you can simply add more brokers to handle the increased load.

NOTE: starting from Kafka 3.3, brokers can communicate and synchronize with each other using the KRaft consensus protocol. This eliminates the need for Apache Zookeeper, a separate service previously used for managing cluster metadata and electing a leader broker.

Redpanda: built on the Apache Kafka protocol with the design goal of achieving higher performances and a simpler deployment of the streaming architecture, Redpanda is a streaming platform for mission critical use cases. Even though it is a drop-in replacement for Kafka, it presents some structural differences: first of all, Redpanda is packaged as a single binary file, thus it removes dependencies from any external tool or library, including the Java Virtual Machine; implementing the core engine from scratch using C++ instead of relying on the JVM is demonstrated to be one of the key characteristics that led to 10x faster tail latencies with up to 3x fewer nodes⁴. In particular, a thread-per-core shared-nothing architecture design allows Redpanda to exploits the full potential of modern many-core hardware: by relying on the Seastar framework a Redpanda broker is capable of polling data directly from the NIC, processing the request and communicating with other threads with Structured Message Passing (SMP) avoiding overheads from context switch and concurrency control mechanisms.

High level request flow architecture, source [6]

Apache Pulsar: unlike Kafka, Pulsar separates compute and storage for maximum efficiency. Brokers handle message flow, processing, and coordination, while a separate system, Apache BookKeeper, takes care of persistent storage using distributed write-ahead logs (WAL) called ledgers. This separation offers several advantages: highly efficient writes with guaranteed consistency, high availability through replication, and optimized I/O for faster performance. Additionally, Pulsar excels at multi-tenancy, allowing you to run isolated logical environments within a single cluster using namespaces for policy and resource management. It is worth highlighting that Pulsar relies on Apache ZooKeeper for cluster coordination: this results in an additional infrastructural complexity that requires to be handled.

Processing Layer

The processing layer is responsible for the execution of the business logic that transforms the events that are ingested and persisted into a streaming platform from various data sources so that eventually they can be queried by clients in order to extract enriched information. Modern streaming processing engines enable a wide spectrum of transformations, ranging from simple, single-record manipulations similar to basic MapReduce functions map and filter, to more complex operations that involve managing an intermediate state, such as computing quantiles and counting distinct values.

Streaming processing engines can be initially categorized into two main groups:

Native streaming engines, such as Apache Flink, operate by applying transformations to incoming data points as soon as they arrive.
Micro-batching engines, such as Spark Structured Streaming, simulate a streaming approach by gathering a small batch of data and processing it using a conventional batch approach.

A further categorization can be made based on the programming model:

Declarative engines provide a more high-level paradigm where developer is only required to chain a sequence of functions and the processing engine responsible for building a Directed Acyclic Graph (DAG) of operators, applying all the optimizations;
Compositional engines require instead a more low-level approach where the developer is responsible for explicitly defining the DAG.

As the most relevant representatives of the aforementioned categories, it is worth mentioning:

Apache Flink: being a declarative engine, Flink automatically composes user-defined operators into an efficient dataflow graph. It guarantees exactly-once semantics, ensuring each event is processed only once, thanks to a checkpointing mechanism that leverages durable storage for fault tolerance. Flink also excels at stateful computations: it provides both in-memory and on-disk state backends, allowing you to choose the right option based on your data size and processing needs. Nonetheless, Flink’s DataSet API lets you seamlessly integrate batch processing tasks alongside your streaming workflows, offering a unified platform for data processing paradigms.
Apache Samza: from an operational standpoint, Samza draws inspiration from Kafka Streams, offering a lightweight deployment option as a client library. This allows integration of stream processing into existing applications without requiring a dedicated cluster. Samza falls into the category of compositional processing engines. Developers need to explicitly define the streaming processing topology, specifying input and output streams for each task. This approach can be more error-prone compared to Flink’s declarative style. Architecturally, Samza is agnostic to the input and output systems, requiring a separate partition/offset system to manage data flow between tasks. Flink, on the other hand, handles this internally.

Serving Layer

Once the streaming data sources are ingested in a streaming platform and processed by a streaming processing engine, the output data must be made available for answering user queries: this is the role of the serving layer.

There are many database technologies that can be used to implement the serving layer, but only a few of them are able to satisfy the following constraints:

support high-frequency write and read operations
support for OLAP workloads, including complex joins and aggregations
sub-second query response time

Real-time databases are specifically built to check such list. Among them, the most prominent examples are:

Apache Druid: in order to achieve low query response time, Druid uses column-oriented storage with dedicated indexes and partitioning schemes for fast filtering. Furthermore, using Apache DataSketches, Druid is able to provide approximate count-distinct, ranking, and computation of histograms and quantiles for further improving the query response time when accuracy is not crucial. From an architectural point of view, there are processes in Druid dedicated to connecting to source data, parsing it, partitioning it and creating Druid’s optimized OLAP ready data format called segments. These individual processes take up a slot in a Java process called the MiddleManager, which can be scaled both horizontally and vertically, so that many processes can be executed at once, including processes that are all consuming from the same data source in parallel. Real-time query response time derives from the local caching of the ingested data in the MiddleManager, which is asked to perform the actual query by the Broker. For durability purposes, all the ingested data is written on deep storage, typically a distributed object store like S3, hence another component called Historical is responsible for periodically pulling data from the deep store and caching it locally. Once data is ingested into Druid, it can be queried using a JSON-based native query engine or using Druid SQL.

Apache Pinot: from an architectural point of view, Pinot is very similar to Druid: they both ingest data in the form of segments, but, as opposed to Druid, where the segment management logic is implemented in its internal components, Pinot relies on Apache Helix, which manages partitioning, replication and query load optimization based on where data is stored in the cluster; real-time data is also locally cached within Servers, whose role is to execute the queries dispatched by the Brokers and periodically flush segments to deep storage. It is important to highlight that, due to its widespread use in industry and extensive community contributions, Druid provides support for a wider range of connectors, both for real-time data ingestion and deep storage, with respect to Pinot, which is only able to ingest streaming data from Kafka, Pulsar and Kinesis.

Conclusions

Ever feel like your data pipeline is stuck in slow-motion? Batch processing might have been the MVP, but in today’s data-driven world, real-time insights have become a business-critical requirement. That’s where Kappa architecture comes in, offering a way to ditch the batch layer and supercharge your data flow.

This guide has been your one-stop shop for understanding the Kappa architecture. The high-level architectural analysis, from ingestion to serving layer, led the discussion towards the most popular open-source solutions for implementing the Kappa architecture from scratch. By the end of it, you should have a good grasp of whether Kappa is the secret weapon your real-time data processing needs.

[1] Jon Peddie Research, “Data age 2025: The evolution of data to life,” 2017 [Online]. Available: https://www.seagate.com/files/www-content/our-story/trends/files/Seagate-WP-DataAge2025-March-2017.pdf

[2] J. Kreps, “Questioning the Lambda Architecture,” Jul. 2014. [Online]. Available: https://www.oreilly.com/radar/questioning-the-lambda-architecture/

[3] N. Marz and J. Warren, Big Data: Principles and Best Practices of Scalable Real-Time Data Systems. Shelter Island, NY: Manning, 2015.

[4] N. Narkhede, G. Shapira, and T. Palino, Kafka: The Definitive Guide Real-Time Data and Stream Processing at Scale, 1st ed. O’Reilly Media, Inc., 2017.

[5] “Redpanda vs. Kafka: A performance comparison (2022 update).” [Online]. Available: https://redpanda.com/blog/redpanda-vs-kafka-performance-benchmark

[6] Redpanda, “Thread-per-core buffer management for a modern Kafka-API storage system.” [Online]. Available: https://redpanda.com/blog/tpc-buffers

[7] “Apache Pulsar.” [Online]. Available: https://pulsar.apache.org/

[8] Imply, “Apache Druid Adoption — Remember! Druid’s Distributed!” Sep. 2022. [Online]. Available: https://www.youtube.com/watch?v=2Ft-0CFkcgE

[9] “Apache Pinot.” [Online]. Available: https://pinot.apache.org/