Lambda and Kappa Architecture

A Data Engineering Perspective on Modern Data Processing

Andy Sawyer
6 min readMay 31, 2024

Data Engineers and Architects are constantly seeking ways to build scalable, fault-tolerant, and efficient data processing systems. Two prominent architectures that have emerged to address these challenges are Lambda and Kappa Architecture. In this article, we will explore these architectures, their key components, and how they relate to the traditional Data Warehouse.

High Level Comparison

Lambda Architecture

Lambda Architecture aims to balance latency, throughput, and fault tolerance by leveraging both batch and stream processing. It allows for handling large volumes of historical data through the batch layer, while also providing real-time insights and updates through the speed layer. The serving layer brings the results together, enabling efficient querying and analysis.

Batch Layer

The batch layer is responsible for processing the entire dataset in a distributed manner, which can take hours, depending on the volume of data. It leverages technologies like Apache Spark, or other distributed processing frameworks to perform compute-intensive tasks on large datasets.

This layer aims to provide a comprehensive and accurate view of the data by processing all the historical data at once. The output of the batch layer is typically stored in cloud storage systems like Amazon S3.

The batch layer is suitable for complex analytics, machine learning model training, and generating large-scale aggregations or reports.

Speed Layer

The speed layer is designed to handle real-time data processing and provide low-latency updates to the serving layer. It processes incoming data streams in real-time or near-real-time, allowing for quick updates and insights.

Technologies commonly used in the speed layer include Apache Spark Streaming, or Apache Kafka. The speed layer performs lightweight computations and transformations on the incoming data, such as filtering, aggregations, or simple business logic.

The results from the speed layer are typically stored in a fast, mutable database or cache, such as Apache Cassandra or Redis, to enable quick access and updates.

Serving Layer

The serving layer combines the batch view (from the batch layer) and the real-time view (from the speed layer) to provide a unified and consistent view of the data. It serves as the interface for querying and analysing the processed data.

The serving layer often uses a scalable and low-latency database or storage system, such as Apache Cassandra, or Elasticsearch, to store the merged data.

Queries and applications can access the serving layer to retrieve the most up-to-date and comprehensive data, taking into account both the batch and real-time processing results.

Challenges

It’s worth noting that the Lambda Architecture has some challenges and considerations:

  • Maintaining and synchronising two separate processing pipelines (batch and speed) can be complex and require additional effort.
  • There may be discrepancies between the batch and speed layer results due to the different processing approaches and latencies.
  • The architecture can be resource-intensive, as it requires running both batch and stream processing systems simultaneously.

Despite these challenges, the Lambda Architecture provides a powerful framework for building scalable and fault-tolerant data processing systems that can handle both historical and real-time data. It has been widely adopted and has inspired the development of alternative architectures, such as the Kappa Architecture, which aims to simplify the design by using a single stream processing engine for both batch and real-time processing.

Kappa Architecture

Kappa Architecture simplifies the data processing pipeline by eliminating the need for a separate batch layer. It relies solely on stream processing to handle both historical and real-time data. The immutable log serves as the single source of truth, capturing all data events, while the materialised views provide fast access to the processed data.

Stream Processing

In the Kappa Architecture, all data are treated as a continuous stream of events, regardless of whether they are historical or real-time data. Stream processing technologies, such as Apache Kafka, or Apache Spark Streaming, are used to process and analyse the data in real-time.

The stream processing engine consumes data from the immutable log, performs the necessary computations, and updates the materialised views. The processing logic can include filtering, aggregations, joins, and complex event processing.

Stream processing enables low-latency processing and real-time insights, as data is processed as soon as it arrives.

Immutable Log

The immutable log serves as the single source of truth in the Kappa Architecture. It is an append-only, distributed log that stores all the data events in their original form.

Technologies like Apache Kafka or Amazon Kinesis can be used as the immutable log. The log provides a persistent and ordered record of all data events, allowing for data replay and reprocessing if needed.

The immutable nature of the log ensures data integrity and enables fault tolerance, as data can be reprocessed from any point in time.

Materialised Views

Materialised views are the result of processing the data from the immutable log using the stream processing engine. They represent the computed state or aggregations of the data and are stored in a scalable and low-latency database or storage system.

Technologies like Apache Cassandra, Apache HBase, or Elasticsearch can be used to store and serve the materialised views. The materialised views are continuously updated in real-time as new data arrives and is processed by the stream processing engine.

Queries and applications can access the materialised views to retrieve the most up-to-date and processed data.

Benefits of the Kappa Architecture

There are a number of benefits that Kappa Architecture bring over Lambda Architecture:

  • Simplified architecture: By using a single stream processing engine, the Kappa Architecture reduces complexity and eliminates the need to maintain separate batch and speed layers.
  • Real-time processing: All data are processed in real-time, enabling low-latency insights and faster decision-making.
  • Unified data processing: The same processing logic is applied to both historical and real-time data, ensuring consistency and reducing discrepancies.
  • Fault tolerance: The immutable log allows for data replay and reprocessing, providing fault tolerance and enabling the system to recover from failures.

Challenges

However, there are also some considerations when adopting the Kappa Architecture:

  • Increased storage requirements: As all data are stored in the immutable log, the storage requirements can be higher compared to the Lambda Architecture.
  • Processing complexity: The stream processing engine needs to handle both real-time and historical data processing, which can be complex and require careful design and optimization.
  • Technology selection: Choosing the right technologies for the immutable log, stream processing, and materialised views is crucial to ensure scalability, performance, and compatibility.

Overall, the Kappa Architecture provides a streamlined approach to data processing, focusing on real-time processing and simplifying the architecture. It has gained popularity in scenarios where real-time insights and simplified data processing pipelines are prioritised.

Relation to Data Warehouse

Both Lambda and Kappa Architecture can be seen in the context of the traditional Data Warehouse. While Data Warehouses have historically been designed for batch processing and storing historical data, Lambda and Kappa Architectures incorporate real-time processing capabilities.

Lambda Architecture can be used in conjunction with a Data Warehouse, where the batch layer feeds data into the warehouse for long-term storage and analysis, while the speed layer handles real-time processing and serves low-latency results.

Kappa Architecture, on the other hand, can potentially replace the need for a separate Data Warehouse by storing all the data in the immutable log and deriving materialised views for analysis. However, it may still be beneficial to integrate Kappa Architecture with a Data Warehouse for certain use cases, such as long-term data retention and complex analytical queries.

Conclusion

Lambda and Kappa Architectures offer powerful approaches to data processing, addressing the challenges of handling both batch and real-time data. While Lambda Architecture provides a hybrid approach, Kappa Architecture streamlines the process by focusing solely on stream processing. The choice between the two depends on the specific requirements of the data processing system and the trade-offs between complexity and performance.

As data engineers, understanding these architectures and their relation to the Data Warehouse is important in designing and implementing efficient and scalable data processing solutions. By leveraging the strengths of each approach, you can unlock the potential of data and drive valuable insights for the business.

--

--

Andy Sawyer

Bringing software engineering best practices and a product driven mindset to the world of data. Find me at https://www.linkedin.com/in/andrewdsawyer/