Streaming Is Different than Batch

Dean Wampler
97 Things
Published in
2 min readJun 7, 2019

Many organizations have a range of sophisticated analytics using tools like Hadoop, SAS, data warehouses, etc. These batch processes have the disadvantage of a potentially long delay between arrival of new data and the extraction of useful information from it, which can be a competitive disadvantage. So, why not move these analytics to streaming services? Why not process new data immediately and minimize delays?

Several challenges make this transition hard. First, different tools may be required. Databases may have sub-optimal read and write performance for continuous streams. Use a log-oriented system as your data backplane, like Apache Kafka or Apache Pulsar, to connect streaming services, sources, and sinks.

Familiar analysis concepts require rethinking. For example, what does a SQL GROUP BY query mean in a streaming context, when data keeps coming and it will never stop? While SQL might seem to be a poor fit for streaming, it has emerged as a very popular tool for streaming logic, once you add windowing, for example over event time. A GROUP BY over windows makes sense, because the window is finite. It also enables large batch jobs to be decomposed into the small, fast increments necessary for streaming.

Now you’re ready to deploy, but streaming services are harder to keep healthy compared to batch jobs. For comparison, when you run a batch job to analyze everything that happened yesterday, it might run for a few hours. If it fails, you’ll fix it and rerun it before the next day’s batch job needs to run. During its run, it might encounter hardware and network outages, it might see data spikes, but the chances are low. In contrast, a streaming service might run for weeks or months. It is much more likely to encounter significant traffic growth and spikes, to suffer hardware and network outages. Reliable, scalable operation is both harder and more important, as you now expect this service to always be available, because new data never stops arriving.

The solution is to adopt the tools pioneered by the microservice community to keep long-running systems healthy, more resilient against failures, and easier to scale dynamically. In fact, now that we’re processing data in small increments, many streaming services should be implemented as microservices using stream-oriented libraries, like Akka Streams and Kafka Streams, rather than always reaching for the “big” streaming tools, like Apache Spark and Apache Flink.

To summarize:

  • Understand the volume per unit time of your data streams.
  • Understand your latency budget.
  • Use Apache Kafka, Apache Pulsar, or similar as your data backplane.
  • Use Apache Spark or Apache Flink when you have either large volumes or complex analytics (e.g., SQL) or both.
  • Use microservices with stream processing libraries, Akka Streams or Kafka Streams, for lower latency requirements and lower volume situations.

--

--

Dean Wampler
97 Things

Streaming data and ML expert at Lightbend. Also on twitter, @deanwampler