Lambda Architecture: A Big Data processing framework

Abhinav Vinci
4 min readDec 25, 2023

--

Whats Lambda Architecture ?

  • It is a hybrid approach to Process Big Data. It supports both batch-processing and stream-processing methods
  • The key idea behind Lambda architecture is to split the data processing into two different paths: a batch layer and a streaming layer.
https://www.interviewbit.com/blog/lambda-architecture/

Why use Lambda Architecture ?

Lambda architecture provides a way to handle both real-time and batch processing in a single architecture. Traditionally, big data processing has been done using a batch processing system where data is processed in large batches at regular intervals, But

Batch processing is slow: Batch processing is well-suited for data processing tasks that do not require real-time processing, such as running periodic reports or updating databases. The delay between data collection and processing is not critical, and the processing time can be scheduled during off-peak hours.

Challenges in just Real Time Processing :

  • Limited Historical Analysis: Real-time-only pipelines may not perform well for historical analysis or complex computations on large datasets.
  • Scalability Concerns: Handling large volumes of data in real-time can pose scalability challenges.

Versatile and Flexible: By separating the processing of data into 2 layers, It provides accuracy and completeness, while also providing low-latency processing of real-time data.

  • It can support different types of queries and analyses by using different query engines for the serving layer.
  • It can support various types of data sources and formats by using different tools and frameworks for each layer.
  • Versatility: Accommodates both batch and real-time processing, providing flexibility for various use cases.

Better Data Integrity: It can ensure data integrity by using the batch layer to correct any errors or inconsistencies that may occur in the speed layer.

Lambda Architecture Overview:

  • Ingestion: Data is ingested into the system from various sources in real-time.
  • Batch Processing: The Batch Layer processes historical data, generating batch views.
  • Real-time Processing: The Speed Layer processes the real-time data stream, producing up-to-date views.
  • Serving Layer : The results from both the Batch Layer and the Speed Layer are stored in the Serving Layer.
  • Querying: Queries from users or applications are handled by the Unified View Layer, which merges results from the Serving Layer to provide a unified and consistent view.

When to Use Lambda Architecture ?

Lambda Architecture should be used if historical analysis and slightly higher latency are acceptable.

  • Opt for a real-time-only pipeline if low latency is a critical requirement.

Lambda Architecture is more complex due to managing multiple layers. Choose it if the benefits of both batch and real-time processing outweigh the complexity.

Use cases of Lambda Architecture:

  • Fraud Detection: Analyzing historical transaction data (Batch Layer) and detecting anomalies in real-time transactions (Speed Layer).
  • IoT Data Processing: Aggregating and analyzing historical sensor data (Batch Layer) while processing real-time data from IoT devices (Speed Layer).
  • Customer Analytics: Analyzing historical customer data for insights (Batch Layer) while providing real-time recommendations based on user behavior (Speed Layer).

Lambda Architecture Components — Overview

  1. Batch Layer: The Batch Layer is responsible for handling the historical data and generating batch views or precomputed results.
  • Processing Engine: Apache Hadoop MapReduce or Apache Spark are commonly used for processing large volumes of data.
  • Data Storage: The results are generally stored in a distributed file system

2. Serving Layer: The Serving Layer is responsible for indexing and serving the batch views generated by the Batch Layer to provide low-latency access to query results.

  • Database: A scalable, distributed database is used to store the precomputed batch views. Technologies like Apache HBase or Apache Cassandra are often employed.

3. Speed Layer: The Speed Layer handles the real-time data processing and provides up-to-date views or results.

  • Processing Engine: Stream processing frameworks like Apache Flink, Apache Storm, or Apache Spark Streaming are commonly used to process real-time data streams.
  • Data Storage: The Speed Layer may utilize in-memory databases or key-value stores to maintain the latest state of the data.

Unified View Layer: The Unified View Layer merges the results from the Batch Layer and the Speed Layer to provide a comprehensive and consistent view of the data.

Conclusion : Lambda Architecture, with its ability to handle both batch and real-time processing, remains a valuable approach in dealing with diverse and dynamic data processing requirements.

Coming Soon — Lambda Architecture Part 2 : We will discuss

  • Detailed overview of Unified View Layer , Serving Layer, Streaming Layer
  • More about Query Coordinator: a layer that manages the coordination and merging of results from both the Batch and Speed layers.
  • Cons of lambda architecture

--

--