Streaming is a myth — Reactive vs Rigid is the real distinction
Tl;dr — Streaming architectures are a special case of reactive architectures (and not actually a myth). The distinction between reactive and rigid distributed architectures is much more important than streaming vs batch, which are two implementation styles. In this article I’m going to talk about reactive batch systems, and when you might use them instead of streaming. Streaming has a lot of awesome usecases and properties, I’m just not going to talk about them in this article.
Why you care about reactive vs rigid not batch vs streaming
In short: you want resilient — or “antifragile” — systems instead of fragile systems or specific buzzword compliance.
Anyone who has built multiple distributed systems (especially in the data space) knows that:
- batch processing is a really easy way to get started;
- there are lots of tools for batch processing; and
- synchronising one batch job to run after another is supposed to have completed is a pain, and will quickly fail.
Streaming architectures are one fashionable response to this — and a good response — but adoption has been limited, in no small part because streaming is seen as more complex or difficult.
In fact, streaming systems are more difficult in the specific sense that there is not a widely diffused body of knowledge and tooling for developing and testing them. You also need monitoring and tooling to ensure that the streaming processes are always running (which could be as simple as a process supervisor like monit or supervisor, or even a cron job).
What is a “reactive” system anyway?
There’s a manifesto laying out the meaning of the term “reactive”. In short, it means a system that is able to handle changes in load gracefully, be at least somewhat failure tolerant, and rely on asynchronous message passing.
What’s batch got to do with it?
Specifically, what has batch processing got to do with reactive systems? There are two aspects to this:
- How can batch systems even be reactive?
- Why would I still want to use batch in this brave new world?
How can batch systems even be reactive?
First, let’s answer how streaming systems are reactive: they buffer data, and spread processing over time. Even without elastic provision of computing resources, as long as system load doesn’t exceed a critical level, all data will be processed eventually, within the time guarantees of the system. Put another way: streaming systems process the data that’s available when it’s available.
Another aspect of streaming systems that makes them simple is that they usually eliminate any form of centralised control plane to ensure sequencing of jobs.
There is absolutely no reason that batch systems can’t also process the data that’s available, when it’s available. Instead, batch jobs can on each iteration check what data is available, use some form of explicit or implicit locking to ensure that another invocation doesn’t run on that data, and process away. With this style of processing, there is also no need for a centralised control system to ensure that jobs are running in sequence — their clocks can drift, and jobs can run “too long” for their allocated slots without the system as a whole falling behind.
Tell me about some real reactive batch systems
The data processing pipeline we built out at Handy was reactive by virtue of the following properties:
- Each invocation of the stream-to-hive transfer job used the timestamp of the objects (“files”) in s3 to know which of those it was supposed to be processing, and would not process later objects.
- Each invocation checked to see if prior invocations had completed, and would pick up the work of previous invocations that had failed. If that meant it ran longer than its slot, the next job would simply fail by design, and then catch up on a later invocation. (This was by design because we didn’t have output locking, or guaranteed non-interference when the jobs were performing outputs to hive directories).
In fact, there’s no reason why this should be built as a batch system now, but streaming solutions have come a long way in the last 4 years.
Tell me about another reactive batch system
This is a job to transfer data from a real streaming solution — kafkaconnect to hdfs time partitioned directories — to an environment where daily data sets are assumed to be complete, and not partial (i.e. it’s a time-synchronized batching environment).
This job checks that kafka connect has not output any new data for two of its flush periods (
rotate.interval.ms), and if that condition is met, copies the data to its target location. A filesystem lock is used to ensure that only one instance is operating on one dataset.
Reactive batch systems: what are they good for?
My second example is a great use of a reactive batch system: to bridge the gap between reactive land (including specifically streaming land) and systems that assume a batch-and-clock based view of the world.
Another good (or goodish) reason to use a reactive batch system is when you can’t get the operational support to install the infrastructure needed for a streaming or non-batch reactive solution.
Related to that, when you want to quickly patch in some reactivity before developing a streaming integration, this can be a good solution. If you develop your code in such a way that it does not assume a stream or batch oriented view of the world, (except at the edge of the system), you may even be able to reuse the core logic in a streaming context.
- Check out Kafka Connect from Confluent or Landoop
- Follow my blog for more data engineering and management content
- Read about airflow, one way of implementing batch reactive (and non-reactive) systems
- Clap 👏 for this post below, so other people can find it
- Email me email@example.com about how I can help with your project or company.