The Lambda Architecture, simplified

Addressing complexity in a decades old architecture

“Everything should be made as simple as possible, but not simpler. “ — Albert Einstein

I loathe complexity. I didn’t always, but as I get older I seem to tolerate it less and less. Whether it’s banging up against my brain’s ability to overcome the magic number or seeing the beauty in Occam’s Razor, and what it produces, reducing complexity has for a long time been one of my main missions in life. It didn’t hurt that this was drilled into me on a daily basis during the first decade of my professional career as I developed and maintained a sophisticated software system in which complexity was avoided at all cost.

It’s primarily because of my aversion to complexity that I’ve always been uncomfortable with the Lambda architecture. For those unfamiliar with the Lambda architecture, it arose from a blog post authored by Nathan Marz back in 2011. To ridiculously over-simplify Lambda, the idea is to split complex data systems into a “real-time” component and a “batch” component. Data flows into the data system at an extremely high rate of speed into both components.

Why flow all of the data into both components? The “Catch-22” with the Lambda Architecture is that the “batch” component can’t make the data immediately available for queries (it must perform some pre-processing first) and the “real-time” component is not efficiently queryable, at least for certain types of analytical (i.e. long-running, complex) queries. As a result, if querying all data is required by the application, queries must be run against both systems, with the data aggregated on the application-side.

Lambda Architecture as proposed by Nathan Marz

The reason I’m so uncomfortable with the Lambda Architecture isn’t only because of its complexity, its maintenance of two copies of the data, and unrealistic expectations on application developers (isn’t the point of a data system to abstract complexity away from the application, not push the complexity up to the application?). The main reason for my discomfort with Lambda is that it fills me with a sense of déjà vu.

There is no such thing as a new idea. It is impossible. We simply take a lot of old ideas and put them into a sort of mental kaleidoscope. We give them a turn and they make new and curious combinations. — Mark Twain

The problem faced in the Lambda Architecture is not new — it’s been a thorn in the side of large data systems for decades. Consider the interplay between traditional operational data stores and data warehouses. In these “systems”, data is first collected in one or more operational data stores. These operational data stores are generally ill suited to analytical queries for a number of reasons:

  1. Their QoS requirements (or line-of-business ownership) prohibit analytical queries from co-existing on the same hardware
  2. The data is typically in a schema or data format (row organized) which isn’t well suited to analytical queries
  3. The analytics data must often be aggregated from multiple operational data stores for a full view of the enterprise

The end result is two distinct classes of data store, handling data at different speeds, with some processing/transformation occurring in the “batch” component— essentially, a Lambda Architecture.

Traditional Architecture — Operational Data Stores continually feeding Data Warehouse through Message Queue and Continuous Data Ingest process

I’d venture to guess that such systems are in place in at least 40 of the FORTUNE 50 corporations.

Those who cannot remember the past are condemned to repeat it. — George Santayana

Why do I bring this up? Curiously enough, right around the time that Lambda emerged (and long before it was widely adopted), the traditional operational data store + data warehouse architecture was being disrupted by Hybrid Transactional/Analytical Processing (HTAP) technology. The idea behind HTAP is to use a single system to handle both transactional and analytical workloads. To make things perform (on both the “real-time” and “batch” sides of the house), these systems are typically in-memory (or are in-memory optimized), employ multiple data formats, and perform some sort of data transformation. In the end however, they appear as single systems from an application perspective.

The best way to predict the future is to invent it — Alan Kay

So where does this leave us with respect to the Lambda Architecture? Can we not try to replace its complexity with an HTAP solution as well? I think the industry is already moving in this direction, as evidenced by Db2 Event Store.

Db2 Event Store is capable of ingesting over a million data points per second per node, and stores its data in an open analytical friendly format — Apache Parquet. Additionally, it’s tightly integrated with Apache Spark, to provide both SQL-based query support, as well as machine learning capabilities. As it’s a single system though, it’s simple to setup, and applications don’t require special logic to query ALL of the data.

To hide the complexity of Lambda, Db2 Event Store quickly lands data on locally attached SSDs (or NVMe, where available) and replicates it to remote nodes for high availability (much like Cassandra). At this point, all ingested data is available for queries, although not in its most efficient form. To make things more long-term efficient, at some later point in time (typically a second or two from data ingest) the data is reformatted into Apache Parquet and indexed by a background thread, at which point it’s pushed to a configurable shared storage layer (GlusterFS, NFS, S3, IBM Cloud Object Storage). Once the data lands on the shared storage layer, since it’s written in Apache Parquet format, it becomes available to any remote runtime engine capable of reading Apache Parquet data.

Db2 Event Store supports common streaming sources, persists data quickly to local storage, and then enriches data through indexing and additional meta-data asynchronously in batch. All data is stored in Apache Parquet format on shared storage and can therefore be queried through the Db2 Event Store engine, or directly via any Apache Parquet compatible query engine.

While some might argue that the Db2 Event Store architecture is very close to the Lambda architecture, a critical distinction is that the Db2 Event Store engine obviates the need to write applications against two components. Instead, applications which require both real-time and batch data can query a single data store. Additionally, applications which can live with a small delay (again, only a few seconds) can query the Apache Parquet data directly from shared storage, thus allowing for the separation of resources between ingest and query processing, while still maintaining a single copy of the data.

If you’re struggling with Lambda and want to cut through the complexity, or are about to start out on a fast data journey and want a simple, full-stack solution, you can learn more about Db2 Event Store here. The easiest way to get your feet wet is to download the free developer edition, which runs on a laptop/desktop and is capable of high speed ingest (speed dependant on I/O performance of machine) and real-time querying. Alternatively, if you’ve got questions about Db2 Event Store, or Lambda solutions in general, please reach out.