Data Processing Architectures

Seckin Dinc
6 min readMar 5, 2024

--

Photo by Dan Freeman on Unsplash

As different use cases pop up every single day, data leaders evaluate their architecture designs accordingly. Some stick with the modern data stack! terminology to update their resumes, some stick with open source solutions to rebel against the modern data stack evangelists, and some just try to solve things.

In whichever team you represent, there is one common problem: data processing. Over the data processing patterns and tools are changing and evolving. In this article, I will introduce Lambda and Kappa Data Processing Architectures as the foundation of data processing architecture decisions.

Lambda Architecture

Lambda Architecture is a design pattern used in big data systems to handle both real-time and batch processing of data. It was introduced by Nathan Marz to address the challenges of processing large volumes of data with low latency.

Image courtesy https://learn.microsoft.com/en-us/azure/architecture/databases/guide/big-data-architectures

Before we dive into the details of the architecture, let’s take a look into the use cases to better understand why and where we need Lambda Architecture.

What are the common use cases?

Lambda Architecture is suitable for various use cases where there is a need to process large volumes of data in both real-time and batch modes. Some common use cases include:

  • Fraud Detection: Detecting fraudulent activities such as credit card fraud or identity theft requires analyzing large volumes of data in real-time to identify suspicious patterns and behaviours. Lambda Architecture allows organizations to process streaming data in real-time for immediate detection while also analyzing historical data to improve fraud detection algorithms.
  • Internet of Things (IoT) Data Processing: IoT devices generate a vast amount of data that needs to be processed and analyzed in real-time to derive insights and take appropriate actions. Lambda Architecture can handle the real-time processing of IoT data streams while also performing batch processing for long-term analysis and optimization.
  • Recommendation Systems: Personalized recommendation systems, used in e-commerce, media streaming, and social networking platforms, rely on real-time user interactions as well as historical data to generate accurate recommendations. Lambda Architecture facilitates the processing of both real-time user interactions and batch processing of historical data to continuously improve recommendation algorithms.

What is the structure of Lambda Architecture?

Lambda architecture is composed of 3 layers:

Batch Layer

The batch layer is responsible for processing historical data in large batches and storing the results in a centralized data store, such as a data warehouse or a distributed file system. We usually store the incoming data in views, materalised views or tables that are optimised, indexed and ready to be consumed.

The batch layer stores the data in immutable and append-only forms. This helps organisations to preserve their historical data and access it whenever it is needed.

Speed Layer (Stream Layer)

The batch layer has a latency from its nature. At most cases the batch data is updated once or twice in a day. At most use cases it can be sufficient to proceed with the downstream uses cases, but at some the latency can be an issue. In this regard, we need to feed the data in a streaming fashing to minimise the data gap.

The speed layer handles real-time data processing. It processes the incoming data streams in near real-time and produces incremental updates. These updates are then merged with the results from the batch layer to provide a unified view of the data. The job of the speed layer is to narrow the gap between when the data is created and when it’s available for querying.

Serving Layer

The serving layer serves as the access point for accesing the data. It combines the results from both the batch and speed layers and provides a consistent view of the data. The data serving layer receives the batch views from the batch layer on a predefined schedule. This layer also receives the near real-time views streaming in from the speed layer.

Kappa Architecture

Kappa Architecture is a data processing architecture that is designed to provide a scalable, fault-tolerant, and flexible system for processing large amounts of data in real time. It was developed as an alternative to Lambda architecture. It simplifies the design of big data systems by eliminating the batch layer, thus offering a more streamlined approach for processing real-time data.

Image courtesy https://learn.microsoft.com/en-us/azure/architecture/databases/guide/big-data-architectures

What are the common use cases?

Kappa Architecture is suitable for various use cases where there is a need to process large volumes of data in both real-time. Some common use cases include:

  • Real-time Monitoring and Alerting: Kappa Architecture is ideal for monitoring systems and applications in real-time, such as network traffic, server performance, or application logs. It allows organizations to detect anomalies, performance issues, or security breaches as they happen and trigger immediate alerts or actions.
  • Clickstream Analysis: Websites and mobile applications generate vast amounts of clickstream data that need to be processed and analyzed in real-time to understand user behavior, optimize user experience, and deliver personalized content or recommendations. Kappa Architecture enables organizations to process clickstream data streams in real-time and derive actionable insights without the need for batch processing.
  • Supply Chain Optimization: Kappa Architecture can be applied to optimize supply chain operations by processing data streams from various sources, such as inventory systems, logistics networks, and sales channels, in real-time. It allows organizations to monitor supply chain performance, identify bottlenecks, predict demand, and optimize inventory levels in real-time.

What is the structure of Kappa Architecture?

Kappa architecture is composed of 2 layers:

Data Ingestion Layer

This layer is responsible for collecting and ingesting data from various sources in real-time. Data is continuously streamed into the system without the need for batching or precomputing. Technologies like Apache Kafka or similar distributed messaging systems are commonly used for data ingestion in Kappa Architecture.

Stream Processing Layer

In Kappa Architecture, the stream processing layer is the heart of the system. It handles both real-time data processing and historical data replay. Stream processing frameworks like Apache Flink, Apache Samza, or Apache Storm are utilized to process data streams in real-time. These frameworks provide the necessary capabilities to perform complex transformations, analytics, and computations on the incoming data streams.

Conclusion

Both Kappa and Lambda Architectures provide solutions for processing large volumes of data in real-time and batch modes, each with its own strengths and use cases.

Lambda Architecture, with its batch, speed, and serving layers, offers a robust framework for handling complex data processing requirements, including historical analysis and batch computations.

On the other hand, Kappa Architecture simplifies the design by eliminating the batch layer, focusing solely on real-time stream processing. This streamlined approach reduces latency, simplifies maintenance, and offers a unified processing model for both real-time and historical data.

The choice between Kappa and Lambda Architectures depends on the specific needs of the use case, balancing factors such as latency requirements, data complexity, and system complexity.

Thanks a lot for reading 🙏

If you are interested in Data Engineering, don’t forget to check out my new article series.

--

--

Seckin Dinc

Building successful data teams to develop great data products