Comparing Stateful Stream Processing and Streaming Databases
How do these two technologies work? how do they differ, and when is the right time to use them?
Choosing between a stateful stream processor and a streaming database has been a debatable question for a long time. I’ve been combing through the Internet to find a few rationales, failed, and decided to write this post to share my experience and knowledge with you.
After reading this, you should know how these technologies work internally, their differences, and when to use them.
Overview
A stream processing application is a DAG (Direct Acyclic Graph), where each node is a processing step. You write a DAG by writing individual processing functions that perform operations when data flow passes through them. These functions can be stateless operations like transformations or filtering or stateful operations like aggregations that remember the result of the previous execution. Stateful stream processing is used to derive insights from streaming data.
As their name suggests, streaming databases can ingest data from streaming data sources and make them available for querying immediately. They extend the stateful stream processing and bring additional features from the databases world, such as columnar file formats, indexing, materialized views, and scatter-gather query execution. Streaming databases have two variations: incrementally updated materialized views and real-time OLAP databases.
Differentiating criteria
Let’s explore each other’s differences across a few criteria.
1. Streaming data ingestion and state persistence
Both technologies are equally capable of ingesting data from streaming data sources like Kafka, Pulsar, Redpanda, Kinesis, etc., producing analytics while the data is still fresh. They also have solid watermarking strategies to deal with late arriving data.
But they are quite different when it comes to persisting the state.
Stream processors partition the state and materialize it into the local disk for performance. This local state is periodically replicated to a remote “state backend” for fault tolerance. This process is called checkpointing in most stateful streaming implementations.
On the other hand, streaming databases follow a similar approach to many databases. They first write the ingested data into disk-backed “segments,” a column-oriented file format optimized for OLAP queries. Segments are replicated across the entire cluster for scalability and fault tolerance.
2. Querying the state
Since the state is partitioned across multiple instances, a single stream processor node only holds a subset of the entire state. You must contact multiple nodes to run an interactive query against the full state. In that case, how do you know what nodes to contact?
Fortunately, many stream processors provide endpoints to run interactive queries against the state, such as State stores in Kafka Streams. However, they are not as scalable as they promise. Alternatively, stream processors can write the aggregated state to a read-optimized store, such as a key-value database, to offload the query complexity.
When it comes to querying the state, streaming databases behave similarly to regular OLAP databases. They leverage query planners, indexes, and smart query pruning techniques to improve query throughput and reduce latency.
Once a streaming database receives a query, the query broker scatters it across the nodes hosting the relevant segments. The query is executed locally to the node. The broker gathers the results, stitches them together, and returns them to the caller.
3. Placement of state manipulation logic
Stream processors require you to know the state manipulation logic beforehand and bake it into the data processing flow. For example, to calculate the running total of events, you must first write the logic as a stream processing job, compile it, package it, and deploy it across all instances of the stream processor.
Conversely, streaming databases have an ad-hoc and human-centric approach toward state manipulation. They offload the state manipulation logic to the consumer application, which could be a human, a dashboard, an API, or a data-driven application. Ultimately, consumers decide what to do with the state rather than deciding it beforehand.
4. Implementation of state manipulation logic
Stream processors give you many interfaces to implement the logic to access and manipulate the state. As a developer, you write the dataflow using a programming language of your choice, such as Java, Scala, Python, Go, etc. The stream processor then takes that code, converts it into an optimized DAG, and distributes it across the cluster. You can also use SQL to specify the data flow. However, programming languages give you more control and flexibility over SQL.
SQL is the primary interface to access the state in most streaming databases. SQL provides a universal, declarative, and concise approach for consumers to access the state. Also, it enables smooth integration between streaming databases and downstream consumers like BI and reporting tools.
5. Placement at the streaming ETL pipeline
With stream processors, you develop the processing logic as a DAG, which has a source (where the data flow origins) and a sink (where the data flow terminates). Each node in the DAG represents a processing function such as map(transformations), filter, aggregate, reduce, etc.
When this DAG executes, it will make it possible for the entering the stream processor to end up in a different location, perhaps a different machine runtime.
So, steam processors can perform streaming ETL operations that pre-process the data before writing to the disk or another system. For example, we can combine streams to enrich, filter unwanted data, and perform windowed aggregations before writing the final output.
Streaming databases are not excellent at performing pre-processing the data. While they can still perform lookup joins with small static tables and some lightweight transformation during data ingestion, they are not designed to handle complex data massaging.
Therefore, stream processors are placed somewhere in the middle of a streaming ETL pipeline, while streaming databases are placed at the serving side (termination) of the pipeline.
6. Ideal use cases
Once you deploy the state manipulation logic, stream processors continue execution without further human intervention. Therefore, stream processors are ideal for use cases where you need to make quick decisions with minimal human error. Anomaly detection, real-time personalization, monitoring, and alerting are some examples where decision-making can be automated and offloaded to a machine.
Streaming databases cater to use cases where fast and fresh data is needed but in an ad-hoc manner (pull-based approach). Therefore, they are good for human decision-makers who query the state via an application, dashboard, or API. User-facing analytics, data-driven applications, and BI are some good use cases for streaming databases.
Conclusion
Lines have started to blur between stateful stream processors and streaming databases. Both are trying to solve the problem in different ways. Choosing either should be driven by use cases, skill gaps, and how fast you want to hit the market.
So, let me give you a few pointers to aid in your selection criteria.
Use stream processors when
Stateful stream processors are good when you know exactly how to manipulate the state ahead of time. They are ideal for use cases designed for machines, where low-latency decision-making is a top priority with less human intervention. Also, if your streaming data needs further cleaning and massaging before ultimate decision-making, you can use a stream processor as a streaming ETL pipeline.
If you need fast access to the materialized state, you can write the state to a read-optimized database and run queries.
Use streaming databases when
Streaming databases are ideal for use cases where you can’t predict the data access patterns ahead of time. They first capture the incoming streams and allow you to query on demand with random access patterns. Streaming databases are also good when you don’t need heavy transformations on the incoming data, and your pipeline terminates at the serving layer.
Why not both?
In my opinion, they are not competitors — they will complement each other like you pair wine with good cheese.
If you have a complex set of use cases, you can use the stream processor to run use cases requiring ETL, windowing, and anomaly detection on the stream processor. You use a streaming database to collect the enriched, cleansed output from the stream processor and expose it to the outside to be queried in a scalable manner.
All-in-one?
Having both of them in a single deployment would be a pipe dream.
Materialize, RisingWave, and DeltaStream are emerging technologies in this space trying to bring stream processing and streaming databases together as a self-serve platform.
I still haven’t got time to work closely with such technologies. Maybe I will do it in the future and produce another blog.