Comparing Stateful Stream Processing and Streaming Databases

How do these two technologies work? how do they differ, and when is the right time to use them?

It has always been a difficult choice!…

Overview

A stream processing application is a DAG (Direct Acyclic Graph), where each node is a processing step. You write a DAG by writing individual processing functions that perform operations when data flow passes through them. These functions can be stateless operations like transformations or filtering or stateful operations like aggregations that remember the result of the previous execution. Stateful stream processing is used to derive insights from streaming data.

Differentiating criteria

Let’s explore each other’s differences across a few criteria.

1. Streaming data ingestion and state persistence

Both technologies are equally capable of ingesting data from streaming data sources like Kafka, Pulsar, Redpanda, Kinesis, etc., producing analytics while the data is still fresh. They also have solid watermarking strategies to deal with late arriving data.

How checkpointing works in Flink

2. Querying the state

Since the state is partitioned across multiple instances, a single stream processor node only holds a subset of the entire state. You must contact multiple nodes to run an interactive query against the full state. In that case, how do you know what nodes to contact?

Partitioned state of stateful stream processors
How segments are distributed across Pinot servers

3. Placement of state manipulation logic

Stream processors require you to know the state manipulation logic beforehand and bake it into the data processing flow. For example, to calculate the running total of events, you must first write the logic as a stream processing job, compile it, package it, and deploy it across all instances of the stream processor.

Streaming databases offer ad-hoc state manipulation that suits both humans and applications

4. Implementation of state manipulation logic

Stream processors give you many interfaces to implement the logic to access and manipulate the state. As a developer, you write the dataflow using a programming language of your choice, such as Java, Scala, Python, Go, etc. The stream processor then takes that code, converts it into an optimized DAG, and distributes it across the cluster. You can also use SQL to specify the data flow. However, programming languages give you more control and flexibility over SQL.

5. Placement at the streaming ETL pipeline

With stream processors, you develop the processing logic as a DAG, which has a source (where the data flow origins) and a sink (where the data flow terminates). Each node in the DAG represents a processing function such as map(transformations), filter, aggregate, reduce, etc.

6. Ideal use cases

Once you deploy the state manipulation logic, stream processors continue execution without further human intervention. Therefore, stream processors are ideal for use cases where you need to make quick decisions with minimal human error. Anomaly detection, real-time personalization, monitoring, and alerting are some examples where decision-making can be automated and offloaded to a machine.

Conclusion

Lines have started to blur between stateful stream processors and streaming databases. Both are trying to solve the problem in different ways. Choosing either should be driven by use cases, skill gaps, and how fast you want to hit the market.

Use stream processors when

Stateful stream processors are good when you know exactly how to manipulate the state ahead of time. They are ideal for use cases designed for machines, where low-latency decision-making is a top priority with less human intervention. Also, if your streaming data needs further cleaning and massaging before ultimate decision-making, you can use a stream processor as a streaming ETL pipeline.

Use streaming databases when

Streaming databases are ideal for use cases where you can’t predict the data access patterns ahead of time. They first capture the incoming streams and allow you to query on demand with random access patterns. Streaming databases are also good when you don’t need heavy transformations on the incoming data, and your pipeline terminates at the serving layer.

Why not both?

Combined architecture

All-in-one?

Having both of them in a single deployment would be a pipe dream.

--

--

EdU is a place where you can find quality content on event streaming, real-time analytics, and modern data architectures

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Dunith Dhanushka

Editor of Event-driven Utopia(eventdrivenutopia.com). Technologist, Writer, Senior Developer Advocate at Redpanda. Event-driven Architecture, DataInMotion