Batch or Stream?

Benjamin Djidi
3 min readMay 16, 2022

--

Walking down the data pipeline — Image courtesy of Unsplash

Batch or stream is one of the hottest debates of the moment for data practitioners. So, through all the vendor talk and industry habits, which is it?

The truth is, It really depends. Choosing between batch and stream pipelines or a mix of both boils down to your specific needs and capabilities: team and operational readiness, pull/push consumers, tooling, SLAs… Counterintuitively, the last thing it’s about is probably the one drawing the most ink — “real-time”. Most of all, it’s about implementation and requirements: bad streaming will consistently underperform batch and vice-versa.

Operations

There’s an extreme operational efficiency to be found in streaming services. Think batch minus the scheduling and truncated windows. Due to the continuous nature of streaming operations there’s a better handling of late events and down times (which only lasts for the duration of the fail as opposed to blocking the next run). Replays are achieved by resetting the cursor/offset on the source jobs and downstream updates will simply follow suit. In batch you’d likely either run a dedicated DAG or trigger a manual run using job-level parameters.

System complexity

Streaming systems are still complex. Currently, frameworks and vendors are busy packaging things into scalable and easily deployable templates but nothing beats the ease of spinning up a data warehouse on a cloud vendors’ premise (or just dumping data into an elastic file storage). Once that’s there, the batch world will have you working with a mere orchestrator. In the streaming world — and assuming you’re building yourself — you’ll need to provision and maintain far more components (brokers, state management, processing engines and the likes…).

Toolbox

Batch is far more mature on the toolbox. Decades and generations of tools have been built to live on indexed storage. There’s a saying that the world is powered on Excel — but looking one level lower, the world is powered by queries. From anomaly detection to orchestration to libraries, streaming is still lagging behind.

Pull / push

This one is primarily consumer driven and since the first consumers are dashboards, it usually starts with pull. Keep in mind that it’s a lot easier to enable pulling out of a continuous system (just drop the processed data in a serving layer for the consumer to fetch from) than it is to enable push from a batch service (where you typically end up building a parallel streaming engine anyways). This reality is the reason why businesses with mature data strategies are slowly moving away from Hadoop-based sources of truth and towards controlling operations ahead of storage writes. A good way to know where you stand on that journey is to ask “Can you tell me when X happens?”. If the answer looks anywhere like “How often do you need the answer?”, there’s work ahead.

Resilience

Data contracts are often built-in in the streaming world and that’s amazing. These system-driven agreements prevent issues from arising on the producer side, negating substantial reactive fire-fighting on the consumer side. In batch, the main mechanism consists of data quality checks and anomaly detection post-modelling so it becomes hard to prevent failures. That’s not to say these aren’t necessary, but these are reactive methods.

SLAs

Not too much to say there, it depends on the type of operations you’re doing. To retrieve the result from complex computations, a pre-computed storage is safer. At the same time it also depends on the consumer’s ability to pull data on request. For streaming it’s just the processing time + network time. It’s easier to achieve constrained SLAs in a continuous infra if you are relying on complex relations (and loads of history), especially if you have a serving layer where you only fetch simple aggregations or metrics-level operations.

More from a strategic perspective, but streaming data is easier to operationalize: most systems aren’t dashboards and can’t continuously poll storage to figure out whether a specific action or defect has happened — information needs to come to them. As such, data modelled via streaming is highly reusable and greatly reduces the operational burden on every consumer step.

What did I miss?

Popsink is running pilots, we deploy a turn-key streaming platform for you to get started without any ops. Reach out if you’re looking to experiment! Otherwise, I’m always happy to discuss your data challenges.

--

--

Benjamin Djidi

Co-founder, CEO at Popsink.com | I enjoy motion and changing states | Mostly ranting about data stuff.