Unbundling the Modern Streaming Stack
Why modern streaming stack is replacing the classic streaming architecture? What’s the composition and what values it brings?
Lately, I’ve heard buzzes around “unbundling” on data Twitter. Their focus was primarily on data engineering, spurring many conversations and leading to many open-ended questions.
I’ve been living in the data space for a while, focusing on event streaming and data analytics. So I decided to “unbundle” the modern real-time analytics space as I understood. I hope that will help data professionals, from novices to experts, quickly grasp the real-time analytics landscape.
TL;DR
This post is long and might take some time to digest. So let me summarise the critical points for you.
- Why do we use a streaming stack? What purposes does it serve?
- What is the classic streaming stack?
- What is the modern streaming stack?
- How to navigate the modern streaming stack?
- What’s new in this movement?
What is a streaming stack?
Before we dive deep, let’s do some jargon-busting to ensure that we are on the same page.
A streaming stack is the tools and processes you use to derive insights from unbounded data. In other words, you can use a streaming stack for streaming analytics — processing streaming data to uncover patterns and metrics.
Streaming data: events and event streams
An event can be considered the foundational block of any streaming data architecture. To understand streaming analytics, one must get familiar with event-first thinking.
Event-first thinking treats every state change in a system as a discrete event — each with a unique ID, timestamp, and payload describing what has happened. Event producers capture state changes and publish them as events, allowing interested consumers to process them and gain actionable insights.
Event streams
An event stream is an unbounded sequence of related events, ordered by event time. For example, a sensor periodically emits temperature readings into the readings stream, keyed with a unique ID, partitioned by sensor ID, and timestamped with reading time.
Streaming analytics
We can capture and process an event stream to derive some meaningful information. That is called streaming analytics.
For example:
- The average of sensor readings over the last five seconds indicates whether the temperature is rising or not.
- The changes in GPS coordinates of a vehicle indicate whether it moves or not.
Streaming analytics is different from batch analytics, where the data set is finite and always complete. Streaming analytics has to deal with constantly flowing, unbounded streams of data. Therefore, the tools and technologies used in streaming analytics have a different ecosystem than batch processing.
What is the “classic” streaming stack?
Streaming analytics is not a new thing. It has been there for quite some time in the forms of Complex Event Processing (CEP) and Event Stream Processing. Although most work has been academic-focused, few vendors like Progress Apama, Streambase, and Esper tried to take stream analytics use cases to the production.
But, things started to pick up speed with the invention of big data hype. New stream processing technologies like Apache Storm emerged, were tried, and tested in production at major Internet-scale companies like Twitter. At that time, Twitter used Storm to power its trending topics.
In my opinion, the classic streaming architectures were heavily influenced by Lambda architecture. Stream processing systems fit into the speed layer, focusing more on fast analytics, whereas the batch layer consists of batch analytics systems that are slow but deliver accurate results.
Why classic streaming stack didn’t pickup?
The classic streaming stack had a good mix of stream processing technologies, from Storm, Samza, Spark Streaming to Flink. They all had their glory days, but somehow, those marvelous technologies failed to reach the great majority of software developers.
In my opinion, the following are the primary reasons for the classic streaming stack not reaching the level of average developers.
- Overly complicated technology: Many stream processing frameworks require you to master specialized skills such as distributed systems and performance engineering.
- Limited only to the JVM: Almost all notable stream processors were built on top of JVM languages like Java and Scala, requiring you to learn JVM specifics like JVM performance tuning and debugging Out of Memory issues.
- Higher footprint on infrastructure: Stream processors demand higher CPU and memory. Also, long-term storage was needed for backfilling and historical analysis.
What is the modern streaming stack?
As more organizations started being data-driven and realized the value of real-time analytics, they expected a streaming stack that is affordable, scalable, and maintainable over time.
The modern streaming stack (MSS) was born to address that. It delivers the same thing as the classic streaming stack. But there are subtle differences in between.
MSS is influenced by Kappa architecture
The classic streaming stack was based on Lambda architecture, which requires you to maintain two different distributed systems to deliver the same result. Two systems resulted in maintaining two codebases, separate engineering teams, and support teams. That was overkill for small and medium enterprises that wanted to adopt real-time analytics.
Jay Kreps proposed the Kappa architecture in 2014 to address the shortcomings of Lambda architecture. Kappa architecture enables real-time and batch processing with a single technology stack and looks similar to this.
Kappa architecture maintains only one codebase that does the stream processing on event streams. Unlike the Lambda Architecture, you reprocess when your processing code changes, and you need to recompute your results. And, of course, the re-computation job is just an improved version of the same code, running on the same framework, taking the same input data.
MSS is built with modern, cloud-native technologies
Most MSS components can be installed on major cloud platforms (Kubernetes, Docker, etc.) or come as serverless solutions. That dramatically reduces the hardware footprint on infrastructure and makes real-time analytics affordable for the masses.
MSS goes beyond the JVM and addresses other languages
Streaming technologies are now going beyond the JVM and embracing other languages built for specific purposes such as performance and data science.
Developer friendliness and rich tooling
Improved event tracing and debugging, query editors, and interactive documentation of MSS components make it simpler for the average developer to build and run real-time applications in hours instead of days.
Unbundling the modern streaming stack
The MSS carries critical components coming from Kappa architecture. But in my opinion, it shouldn’t necessarily stop there. It can go beyond that.
Let’s get ready to unbundle it.
The MSS comprises several layers working together to ingest, process, and analyze event streams in real-time. An amalgamation of open-source, commercial, SaaS, or on-prem software components can be seen at each layer, boosting a rich ecosystem.
The high-level architecture of MSS looks like this.
Event streaming platform
An event streaming platform (ESP) is the centerpiece of MSS, providing scalable and fault-tolerant storage to incoming events till downstream applications process them. Most ESPs utilize a distributed, append-only log structure to write incoming events sequentially. This log is partitioned and replicated across multiple nodes for reliability.
Apache Kafka has been the prominent open-source ESP to date. But there are other open-source alternatives, including Apache Pulsar and Redpanda. Redpanda is a C++ rewrite of Kafka, with 100% Kafka API compatibility, trying to provide more performance and developer efficiency than Kafka.
The following platforms provide managed Kafka as a service, taking away the management overhead from developers.
- Confluent Cloud
- AWS MSK
- Aiven
- Azure Event Hubs for Kafka
- Upstash Kafka (provides serverless Kafka)
Event streaming is different from enterprise messaging due to several reasons. But several messaging platforms provide connectors to Kafka to bridge the streaming and messaging worlds. For example:
Apart from that, cloud vendors like AWS and Microsoft also offer serverless ESP solutions with usage-based pricing. AWS Kinesis and Azure EventHubs are two examples.
Real-time data ingestion layer
This layer is responsible for event-enabling the operational systems — detecting state changes and publishing them as discrete events. This “event ingestion” happens in real-time, events are detected and produced as they happen, and no batching is involved.
The event ingestion layer extracts events from various operational systems and delivers them to the event streaming platform for durable storage. The ingestion can be done in several ways.
Change Data Capture (CDC): CDC captures the changes made to OLTP databases as events in real-time. For example, CDC can detect row-level changes made to database tables and formulate them as events with the change applied as the payload. Debezium is a popular open-source CDC platform based on Kafka Connect. It provides a set of connectors to many databases to stream change events into Kafka.
ESP SDKs: Event streaming platforms provide language-specific SDKs and REST APIs for event production. For example, Kinesis provides a Python library, allowing developers to produce events from applications to Kinesis.
Event trackers provide SDKs for developers so that they can instrument their applications to emit well-structured events when state changes happen. Once collected, these events are routed to an event streaming platform by the tracking platform. Segment, Snowplow, and Amplitude are some leading examples.
Stream processing layer
The stream processing layer consumes the events landed in the ESP to process them in real-time and derive actionable insights. This layer consists of stream processors, a distributed framework allowing developers to write and deploy stream processing applications.
In the context of analytics, stream processing applications are primarily used for data engineering tasks (streaming ETL) and streaming analytics. Some example use cases include:
- Stateful aggregations like count, sum, average, etc. E.g., what’s the total value of orders received so far?
- Window operations. E.g., what’s the count of events received during the last five seconds?
- Joining streams together to produce consolidated streams.
- Maintain local materialized views and allow joining them with streams for further enrichment. E.g., use as a lookup table.
- Transform a stream and write it back to the ESP. E.g., Convert a string to uppercase and write to another Kafka topic so that another application can process it.
- Repartitioning or re-keying a stream.
The first-generation stream processors were limited to developers on the JVM. Later, portable runtimes were introduced so that applications written in other languages could be deployed on the same stream processor. For example, Apache Flink, Beam, and Spark Streaming support languages like Python and SQL.
Modern stream processors are targetting a wider developer adoption by improving several features that matter most to the developers. For example, Materialize is wire-compatible with Postgres and enables running streaming SQL queries as if it was a typical Postgres instance. Bytewax is another stream processor, which allows running Python applications on top of it.
Serving layer
The serving layer is the primary access point to consume real-time analytics produced in the streaming stack. Typical consumers of this layer are:
- BI dashboards and reporting infrastructure.
- External user-facing analytics applications.
- Data products and APIs, headless BI platforms.
- Data scientists and data analysts for ad-hoc data analysis, running SQL queries on aggregated data.
- Machine learning pipelines (real-time personalization, ad tech), etc.
The serving layer is exposed to both internal and external audiences, including human users and software appliances, looking for fast and accurate analytics at scale.
Therefore, the serving layer must deliver:
- Serve queries with sub-second latency to provide a better user experience.
- Support a throughput of hundreds of thousands of queries per second to serve an Internet-scale user base.
- Ensure data freshness — serve analytics from data ingested a few seconds ago.
- Run complex OLAP queries, supporting joins, aggregations, and filtering on large data sets.
A real-time OLAP database is ideal for the serving layer as it meets all the requirements stated above. NoSQL databases and data warehouses are alternatives. But they fall short regarding latency, throughput, and data freshness.
Typically, real-time OLAP databases ingest events directly from the ESP and make them available for querying. Alternatively, the stream processing layer can write its output to a Kafka topic to be consumed by the serving layer afterward.
Several open-source real-time OLAP databases exist today, along with the vendors providing managed services.
- Apache Pinot: Managed service offered by StarTree
- Apache Druid: Managed service offered by Imply
- Clickhouse
- Rockset: Available as a SaaS offering
Data products and APIs
The Data Mesh principle coined by Zhamak Dehghani has introduced a decentralized approach for collecting, processing, consuming, and governing data within an organization. The data mesh concept brings product-led thinking to enterprise data management, enabling us to treat analytics as a data product.
My perspective is that we can productize streaming analytics as well. We can package real-time analytics as event-driven APIs, document and catalog them, and have them available for consumption.
Well-documented APIs can be used to disseminate the real-time analytics stored in the ESP or serving layer. For example, we can expose metrics inside the serving layer as RESTful APIs, allowing request-response style consumption. Processed analytics inside the ESP can be exposed as asynchronous APIs, allowing event-driven consumption. For example, we can stream metrics inside a Kafka topic over a WebSockets or Server-Sent Events (SSE).
These APIs can then be treated as data products; concepts such as versioning, discovery and documentation, access control, and billing can be taken into action.
Tiered storage
The streaming stack needs to decouple compute and storage at some point. Raw events ingested into the ESP and the processed analytics in the serving layer must be flushed out to cheap storage as they get older.
The classic streaming stack used the data lake for that. Now, the modern streaming stack uses tiered storage to provide infinite, elastic, and economical storage for streaming data.
Redpanda supports a form of tiered storage through its shadow indexing feature. Kafka is also working on KIP-405 for adding tiered storage. StarTree recently announced the tiered storage for Apache Pinot, allowing computed metrics to be flushed out to remote storage.
Metadata management, schema evolution, and data governance
Well, there are certainly many moving parts in the modern streaming stack. They must be consistent and work in harmony as the stack evolves. Therefore, we need a central mechanism to govern critical aspects in the stack, such as event schema evolution, cataloging metrics, and data API management.
What’s coming next
As I stated above, we can build a modern streaming stack by combining components of different vendors, technologies, and platforms. Although that’s great flexibility for most, some organizations prefer everything built and managed under one roof.
There are specific reasons behind that, including lack of experience in data culture and skills shortage. That is where streaming data platform as a service comes into the scene. Meroxa and Decodable are two companies trying to realize this concept, making real-time analytics accessible and affordable for the masses.
References
Questioning the Lamda Architecture — Jay Kreps
Kappa Architecture is Mainstream Replacing Lambda — Kai Waehner
Data Mesh Principles and Logical Architecture — Zhamak Dehghani