When not to do Batch? an introduction to Stream Processing vs. Batch Processing
Building A Realtime App
A real-time app ingests data come in as data streams and respond to them quickly, within seconds to milliseconds. For example, a real-time app listens to weather data and generate alerts, watch your stock portfolio and generate sell alerts, or let you know when your normal route has traffic.
How can we implement a Real-time App? Can database technology or MapReduce ( e.g. Hadoop or Spark) can support it?
The answer is yes, at least in some use cases. We can direct the data to a database and query the data. However, we need more. We need the alert to be triggered by the system. The relational databases have a concept called triggers, which will trigger an event if certain conditions are met. This can handle our use case.
However, if the data is large, then running simple queries get complicated. At best you need to index the data. At worst, you will have to partition the data. Soon you will have to build star schema and data cube. That is complex work. That is expensive work. In order for this data warehouses to work, you will need to decide the queries apriori. It is very hard to add a new later.
Map-Reduce
MapReduce and Hadoop are built to solve part this problem. Hadoop let you write snippets of code, compose them using MapReduce model and run them using 100s of computers. Compared to the world of Data warehousing this was much simpler. ( Then come Spark, which is same ideas done better). MapReduce sidestepped the complications of indexes and scale by building a framework that let us easily walk through the data and answer the questions.
Yet handwriting code was tedious. Hive ( and later Spark SQL) is built to solve the problem. Now we can write the logic in SQL like query. To implement simple use cases, we do not have to reimplement complicated things like SQL joins anymore. At the same time, if we have to do something complicated, like calculate arcane math function, we can write a UDF( User Defined Function) and call it as a SQL extension. It gives us best of both worlds.
Real-time App with Map-Reduce
Let’s try to implement a real-time App using Hadoop. To understand the scenario, let’s consider a temperature sensor. Assuming the sensor continues to work, we will keep getting the new readings. So data will never stop.
We should not wait for data to finish, as it will never happen. Then maybe we should continue to do analysis periodically (e.g. every hour). We can run Spark every hour and get the last hour data.
What if every hour, we need the last 24 hours analysis? Should we reprocess the last 24 hours data every hour? Maybe we can calculate the hourly data, store it, and use them to calculate 24 hours data from. It will work, but I will have to write code to do it.
Our problems have just begun. Let us iterate few requirements that complicate our problem.
- What if the temperature sensor is placed inside a nuclear plant and our code create alarms. Creating alarms after one hour has elapsed may not be the best way to handle it. Can we get alerts within 1 second?
- What if you want the readings calculated at hour boundary while it takes few seconds for data to arrive at the storage. Now you cannot start the job at your boundary, you need to watch the disk and trigger the job when data has arrived for the hour boundary.
- Well, you can run Hadoop fast. Will the job finish within 1 seconds? Can we write the data to the disk, read the data, process it, and produce the results, and recombine with other 23 hours of data in one second? Now things start to get tight.
- The reason you start to feel the friction is because you are not using the right tool for the Job. You are using the flat screwdriver when you have an Allen-wrench screw.
Stream Processing
The right tool for this kind of problem is called “Stream Processing”. Here “Stream” refers to the data stream. The sequence of data that will continue to come. “Stream Processing” can watch the data as they come in, process them, and respond to them in milliseconds.
Following are reasons that we want to move beyond batch processing ( Hadoop/ Spark), our comfort zone, and consider stream processing.
- Some data naturally comes as a never-ending stream of events. To do batch processing, you need to store it, cut off at some time and processes the data. Then you have to do the next batch and then worry about aggregating across multiple batches. In contrast, streaming handles neverending data streams gracefully and naturally. You can have conditions, look at multiple levels of focus ( will discuss this when we get to windows), and also easily look at data from multiple streams simultaneously.
- With streaming, you can respond to the events faster. You can produce a result within milliseconds of receiving an event ( update). With batch this often takes minutes.
- Stream processing naturally fit with time series data and detecting patterns over time. For example, if you are trying to detect the length of a web session in a never-ending stream ( this is an example of trying to detect a sequence), it is very hard to do it with batches as some session will fall into two batches. Stream processing can handle this easily. If you take a step back and consider, the most continuous data series are time series data. For example, almost all IoT data are time series data. Hence, it makes sense to use a programming model that fits naturally.
- Batch lets the data build up and try to process them at once while stream processing data as they come in hence spread the processing over time. Hence stream processing can work with a lot less hardware than batch processing.
- Sometimes data is huge and it is not even possible to store it. Stream processing let you handle large fire horse style data and retain only useful bits.
- Finally, there are a lot of streaming data available ( e.g. customer transactions, activities, website visits) and they will grow faster with IoT use cases ( all kind of sensors). Streaming is a much more natural model to think about and program those use cases.
Stream Processing is not a Panacea
However, streams are also not a tool for all use cases. One good rule of thumb is that if processing needs multiple passes through full data or have random access ( think a graph data set) then it is tricky with streaming. One big missing use case in streaming is machine learning algorithms to train models. On the other hand, if processing can be done with a single pass over the data or has temporal locality ( processing tend to access recent data) then it is a good fit for streaming.
Hope this was useful. You can find more details about stream processing from Gentle Introduction to Stream Processing. We write real-time apps using Streaming SQL, and you can find more from Stream Processing 101: From SQL to Streaming SQL in 10 Minutes.