Are you trying to understand Big Data and Data Analytics, but confused with batch data processing and stream data processing? If so this blog is for you !
Today developers are analyzing Terabytes and Petabytes of data in the Hadoop Ecosystem. Many projects are relying to speed up this innovation. All of these project are rely on two aspects. They are :
- Batch Processing
- Stream Processing
What is Batch Processing?
Batch processing is where the processing happens of blocks of data that have already been stored over a period of time. For example, processing all the transaction that have been performed by a major financial firm in a week. This data contains millions of records for a day that can be stored as a file or record etc. This particular file will undergo processing at the end of the day for various analysis that firm wants to do. Obviously it will take large amount of time for that file to be processed. That would be what Batch Processing is :)
Hadoop MapReduce is the best framework for processing data in batches. The following figure gives you detailed explanation how Hadoop processing data using MapReduce.
Batch processing works well in situations where you don’t need real-time analytics results, and when it is more important to process large volumes of data to get more detailed insights than it is to get fast analytics results.
What is Stream Processing ?
Stream processing is a golden key if you want analytics results in real time. Stream processing allows us to process data in real time as they arrive and quickly detect conditions within small time period from the point of receiving the data. Stream processing allows you to feed data into analytics tools as soon as they get generated and get instant analytics results. There are multiple open source stream processing platforms such as Apache Kafka, Apache Flink, Apache Storm, Apache Samza, etc. I would recommend WSO2 Stream Processor (WSO2 SP), the open source stream processing platform which I have helped built. WSO2 SP can ingest data from Kafka, HTTP requests, message brokers. You can query data stream using a “Streaming SQL” language. With just two commodity servers it can provide high availability and can handle 100K+ TPS throughput. It can scale up to millions of TPS on top of Kafka. Furthermore, the Business Rules Manager of WSO2 SP allows you to define templates and generate business rules from them for different scenarios with common requirements.
Stream processing is useful for tasks like fraud detection. If you stream-process transaction data, you can detect anomalies that signal fraud in real time, then stop fraudulent transactions before they are completed.
The following figure gives you a detailed explanation how Spark process data in real time.
The reason streaming processing is so fast is because it analyzes the data before it hits disk.
For your additional information WSO2 has introduced WSO2 Fraud Detection Solution. It is built using WSO2 Data Analytics Platform which comprises of Both Batch analytics and Real time analytics (Stream Processing).
Difference Between Batch Processing and Stream Processing
Now you have some basic understanding of what Batch processing and Stream processing is. Let’s dive into the debate around batch vs stream
In Batch Processing it processes over all or most of the data but In Stream Processing it processes over data on rolling window or most recent record. So Batch Processing handles a large batch of data while Stream processing handles Individual records or micro batches of few records.
In the point of performance the latency of batch processing will be in a minutes to hours while the latency of stream processing will be in seconds or milliseconds.
At the end of the day, a solid developer will want to understand both work flows. It’s all going to come down to the use case and how either work flow will help meet the business objective.