Big Data Battle : Batch Processing vs Stream Processing
Are you trying to understand Big Data and Data Analytics, but confused with batch data processing and stream data processing? If so this blog is for you !
Today developers are analyzing Terabytes and Petabytes of data in the Hadoop Ecosystem. Many projects are helping to speed up this innovation. All of these project are rely on two aspects. They are :
- Batch Processing
- Stream Processing
What is Batch Processing?
Batch processing is where the processing happens of blocks of data that have already been stored over a period of time. For example processing all the transaction that have been performed by a major financial firm in a week. This data contains millions of records for a day that can be stored as a file or record etc. This particular file will undergo processing at the end of the day for various analysis that firm wants to do. Obviously it will take large amount of time for that file to be processed. That would be what Batch Processing is :)
Hadoop MapReduce is the best framework for processing data in batches. The following figure gives you detailed explanation how Hadoop processing data using MapReduce
Batch processing works well in situations where you don’t need real-time analytics results, and when it is more important to process large volumes of information than it is to get fast analytics results.
What is Stream Processing ?
Stream processing is a golden key if you want analytics results in real time. By building data streams, you can feed data into analytics tools as soon as it is generated and get instant analytics results using platforms like Spark Streaming. Apache Storm is a stream processing framework .
Stream processing is useful for tasks like fraud detection. If you stream-process transaction data, you can detect anomalies that signal fraud in real time, then stop fraudulent transactions before they are completed.
The following figure gives you a detailed explanation how Spark process data in real time.
The reason streaming processing is so fast is because it analyzes the data before it hits disk.
For your additional information WSO2 has introduced WSO2 Fraud Detection Solution. It is built using WSO2 Data Analytics Platform which comprises of Both Batch analytics and Real time analytics (Stream Processing).
Difference Between Batch Processing and Stream Processing
Now you have some basic understanding of what Batch processing and Stream processing is. Let’s dive into the debate around batch vs stream
In Batch Processing it processes over all or most of the data but In Stream Processing it processes over data on rolling window or most recent record. So Batch Processing handles a large batch of data while Stream processing handles Individual records or micro batches of few records.
In the point of performance the latency of batch processing will be in a minutes to hours while the latency of stream processing will be in seconds or milliseconds.
At the end of the day, a solid developer will want to understand both work flows. It’s all going to come down to the use case and how either work flow will help meet the business objective.