Lambda or Kappa

Ravikiran durbha
Business Intelligence and Analytics
5 min readMar 15, 2022

Evolution of Data Processing Architectures

Before we explore blockchain in BI as implied in my previous post—https://medium.com/business-intelligence-and-analytics/blockchain-bitcoin-and-business-intelligence-part-i-5efec01e1858 , let us see how data processing has evolved over time.

As data inundated us due to the proliferation of the internet and APIs, it changed the landscape of Business Intelligence and analytics in its wake. Developers and Architects had to contend with many challenges stemming from velocity, variety and volume (often referred to as 3Vs) of data to build scalable systems. In general, there are two options to address these challenges— You can add more resources (Memory and/or CPU) to your server (also called vertical scaling) or just add more servers (also called horizontal scaling).

Vertical scaling has limitations after a point (either it is financially impractical or simply no way to add more CPU or memory to the server) and you are only left with horizontal scaling. Enter “Infrastructure as a service” or cloud, where you can scale horizontally with ease. This started a new way to process data as I explored here — https://medium.com/business-intelligence-and-analytics/analytics-engineering-3c300d226571.

With horizontal scaling though, you have to contend with CAP theorem (Also known as Brewer’s conjecture). It was originally proposed by Dr. Eric A Brewer (Professor emeritus at University of California, Berkley).

Figure 1: CAP Theorem (Source: thinkmicroservices.com)

The CAP theorem simply states that any distributed data system (Which is what you will end up with after horizontal scaling) can only guarantee two of the following three properties:

  1. Consistency — After a successful write, all future reads will take it into account.
  2. Availability — Applications can read/write to the system at any time.
  3. Partition Tolerance — System should be tolerant when a network partition occurs (i.e. there is a communication failure between any two nodes in the system)

The implication of the theorem is fairly obvious when you consider your options after a network partition occurs. Imagine two nodes in a system are unable to talk to each other — you are then left with one of two options:

Keep the system available and continue accepting writes and reconcile between the nodes once the partition is healed.

Or

Keep the system consistent by disallowing all writes (and hence not available)until the partition is healed.

Clearly, you cannot do both and when you consider that no distributed system is safe from network partitioning, you quickly realize that we only have one of the two options above. Usually, availability trumps consistency (Option 1) and you end up with “Always available but eventually consistent” systems.

If you pick Option 2 , your system is only available where there is no network partitioning. Not much else to do. On the other hand, if you pick option 1, there are a few things to think about. For example, there are different types of writes — updates and inserts. Reconciliation (after a network partition heals) is far more complex with updates than inserts. This is the one of the reasons why data lake architectures do away with updates and only allow inserts.

The genesis of Lambda architecture then is directly linked to dealing with the CAP Theorem as Nathan Marz, its originator, discusses in his blog — http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html. He doesn't call it the lambda architecture, but that is what it came to be known.

The basic premise is that if data that lands in the lake is immutable (no updates, only inserts and allows us to deal with CAP theorem more efficiently as described above) then all queries are just functions on that data with time as an input parameter. The challenge though is performance. Computing all queries on demand will not complete in a reasonable amount of time, especially as the volume of data grows. We can certainly alleviate the issue by pre-computing to a point in time and providing the end result as a view(s) using batch technologies like Hadoop (Batch Views) and continue providing real time data beyond the point in time on demand using streaming technologies, like Storm.

Figure 2: Lambda Architecture (Source: Datanami.com)

For data platforms in most organizations, this is too simplistic, as the platform also needs to perform the function of integrating data across multiple source systems to provide a single source of truth. For some time, people tried to build this directly on Hadoop as well, but the amount of engineering effort needed to do this effectively is very steep. Companies, like Snowflake, then emerged to fill this gap by abstracting that effort away and providing the data warehouse as a service on cloud. A more realistic pipeline, then, looks like the one in figure 2.

Figure 2: An implementation of Lambda Architecture. (Source: Qlik.com)

Organizations that are more mature (in terms of being data driven) will see more benefit from implementing the streaming layer end-end all the way to the visualization. It is very typical, in most organizations, to see streaming data arrive from some sources and after some analytics ending up in data warehouse for further enhancement of the batch analysis.

In the above setup though, there is a duplication of transformational logic (data, typically, needs to be transformed before being persisted in a warehouse) as there are two paths into the warehouse. Some organizations may do away with the batch input from sources and get all data using streaming technologies (Like Kafka) as shown in figure 3 below. This certainly removes the duplication as there is only one path to the warehouse but adds complexity to the implementation, since everything from transaction consistency to historization has to be coded (which is more easier to implement in a batch layer). This type of architecture (called Kappa Architecture) is more beneficial if the main utilization of the platform is real time analytics and data science or advanced analytics.

Figure 3: An implementation of Kappa Architecture (Source: Qlik.com)

There are many patterns and tools used today for data processing (and ingestion) in cloud, but mapping the first principles to the capabilities needed helps organizations keep it simple.

--

--