BATCH PROCESSING - STREAM PROCESSING

Murat Sivri
İstanbul Data Science Academy
9 min readSep 7, 2022

The topic we are going to talk about data processing, which is one of the important topics in the data field, is related to batch and stream processing. First of all, we will start with what data processing is, and then we will process the details related to batch and stream processing.

a

What is Data Processing?

Data Processing is a technique for manipulating information. It refers to the transformation of unstructured data into content that is both meaningful and machine-readable. It refers to the processing of commercial data using automated methods. Raw data is the source of information that is processed to produce useful results.

For businesses to develop better business strategies and gain a competitive advantage, Data Processing is critical. Employees throughout the organization can understand and use the data if it is converted into a readable format such as graphs, charts, and documents.

Based on the data source and the steps taken by the processing unit to generate an output, there are various types of Data Processing.

In this article, we will cover batch and stream processing topics

Let’s start with stream processing

What Is Stream Processing?

Stream processing is a big data technology that focuses on the real-time processing of continuous streams of data in motion.

A stream processing framework simplifies parallel hardware and software by restricting the performance of parallel computation. Pipelined kernel functions are applied to each element in a data stream, employing on-chip memory reuse to minimize loss in bandwidth. Stream processing tools and technologies are available in a variety of formats: distributed publish-subscribe messaging systems such as Kafka, distributed real-time computation systems such as Storm, and streaming data flow engines such as Flink.

Stream processing often entails multiple tasks on the incoming series of data (the “data stream”), which can be performed serially, in parallel, or both. This workflow is referred to as a stream processing pipeline, which includes the generation of the stream data, the processing of the data, and the delivery of the data to a final location.

Actions that stream processing takes on data include aggregations (e.g., calculations such as sum, mean, standard deviation), analytics (e.g., predicting a future event based on patterns in the data), transformations (e.g., changing a number into a date format), enrichment (e.g., combining the data point with other data sources to create more context and meaning), and ingestion (e.g., inserting the data into a database).

How Does Stream Processing Work?

In order to incorporate event stream processing capabilities into an application, programmers either code the process from scratch or use an event stream processor. In certain first generation data stream processing engines, such as Apache Spark and Apache Storm, users are required to write code, which involves the following processes: events are placed in a message broker topic, events from topics in the broker are programmed to become the data stream and be received, and then publish the results back to the broker.

A streaming data processing architecture will automatically collect the data, deliver it to each actor, ensure they run in the correct order, collect the results, scale for higher volumes, and handle failures. The user is able to write the logic for each actor, wire the actors up, and hook up the edges to the data source(s).

You can either send events directly to the stream processing system or send them via a broker. Then the streaming part of the app can be written using “Streaming SQL,” which provides operators such as windows, patterns, and joins directly in the language, enabling users to query data without needing to write code. Finally the stream processor is configured to act on the results by publishing events to a broker topic and listening to the topic or by invoking a service when the stream processor triggers.

Why Use Stream Processing?

The usefulness of insights obtained from Data Processing was demonstrated by Big Data. Not all of these insights are made equal. Some insights are more beneficial just after they occur, but their value fades rapidly with time. Such scenarios are possible thanks to Stream Processing, which provides insights faster, frequently within milliseconds to seconds of the trigger

What Is Batch Processing?

Batch processing is when a computer processes a number of tasks that it has collected in a group. It is designed to be a completely automated process, without human intervention. It can also be called workload automation (WLA) and job scheduling.
Batch processing is a method of running high-volume, repetitive data jobs. The batch method allows users to process data when computing resources are available, and with little or no user interaction.
With batch processing, users collect and store data, and then process the data during an event known as a “batch window.” Batch processing improves efficiency by setting processing priorities and completing data jobs at a time that makes the most sense.

How Does Batch Processing Work?

The punch card operating system introduced in the 19th century brought a revolution in the way businesses worked. Over a period of time, batch processing evolved to become what it is today — a real-time processing system in which even data entry professionals are not required.

Today, batch processing operations take place without any sort of user interaction and meet several needs of businesses in different industries.

Another distinguishing feature of modern batch processing is exception-based alerts. These notifications let the supervising professionals know if there is an issue in the process. Since these alerts indicate issues, managers don’t have to constantly keep an eye on the batches.

The batch processing system determines these exceptions based on monitors and dependencies:

  • Monitors: These identify abnormalities in a data batch. For instance, if a certain job is taking too long to finish, it would cause a delay in the pipeline since the subsequent job is unable to begin. The monitor identifies this delay and alerts the manager that there’s an exception in the system.
  • Dependencies: These are trigger events that start a batch process. For example, when a customer places an order on your website, the batch process is triggered to forward their request. The dependency is responsible for putting the batch process in motion.

Why Use Batch Processing?

Batch processing began early on in the origin of computers. Batches of punch cards, with computer programming instructions, would be processed at one time. The batch would run until it was completed, or an error occurred, whereupon it would stop and manual intervention would be required.

This method was used when computer resources were limited and lacked today’s enormous processing power. Running these batches at the end of the day meant valuable computer resources weren’t tied up and allowed the machine to process bulk data at top speed.

Batch processing has changed quite a bit over the years. Now, batch data isn’t just an “end-of-the-day” or overnight process. It doesn’t need an internet connection to process, and it can run asynchronously. Basically, these batches can run in the background at any time that’s suitable without interrupting vital processes.

But even so, with today’s massive computing power and cloud computing, there are still very good reasons that batch processing is used.

Batch Processing Vs Stream Processing

Definition

Batch Processing refers to the processing of large amounts of data in a single batch over a set period.

Credit card transactions, bill generation, input and output processing in the operating system, and so on are all examples of Batch Processing.

Stream Processing is the processing of a continuous stream of data as it is generated.

Data streaming, radar systems, customer service systems, and bank ATMs are examples of Stream Processing. These systems require immediate processing to function properly.

Purpose

Batch Processing is frequently used when dealing with large amounts of data and/or when data sources are legacy systems that cannot deliver data in streams.

Mainframe data is a good example of data that is processed in batches by default. It takes time to access and integrate mainframe data into modern analytics environments, making streaming data unfeasible in most cases. Batch Processing is useful when you don’t need real-time analytics and it’s more important to process large volumes of data than it is to get quick analytics results (though data streams can include “big” data as well — Batch Processing isn’t a requirement for working with large amounts of data

If you want real-time analytics, you’ll need to use Stream Processing. Using platforms such as Spark Streaming, you can feed data into analytics tools as soon as it is generated by creating data streams.

Tasks like fraud detection benefit from Stream Processing. You can detect anomalies that indicate fraud in real-time and stop fraudulent transactions before they are completed if you stream-process transaction data.

Use Cases

Batch Processing use cases include:

  • Payroll: You can run multiple payroll runs at the same time using payroll batch processing. As a result, you can process payroll for multiple groups of employees on different pay cycles at the same time.
  • Billing: The practice of processing multiple authorized transactions at once is known as batch payment processing. A merchant may perform one batch processing per day, in which all authorization codes from its customers’ credit cards are sent to each of their banks for approval.
  • Orders from Customers: Batch processing eliminates the need to manually process each order, saving you time, allowing faster shipping, and improving customer satisfaction.

Stream Processing use cases include:

  • Fraud Detection: Streaming transaction data can detect anomalies that signal fraud in real-time, allowing you to stop fraudulent transactions before they happen. Fraudulent transactions can also be detected and stopped in the middle of the transaction by inspecting, correlating, and analyzing the data, which can happen in a variety of industries.
  • Log Monitoring: The technical approach relies on real-time distributed processing of log streams. Stream processing is a technique for querying and processing continuous data streams. It can perform stream and transaction analyses before extracting data from existing streams and creating new streams for new use cases.
  • Analyzing Customer Behavior: Stream processing on streaming data is very useful in the online advertising industry. It is used in social networks to track user behavior, clicks, and interests, and then serve ads to each user based on this information. It promotes advertisements that may be of interest to the users. As a result, stream processing aids advertising campaigns by real-time processing of user clicks and interests and displaying sponsored content.

Hardware

Batch Processing can be executed with standard computer specifications. To process large batches of data, Batch Processing necessitates the use of the majority of storage and processing resources.

Stream Processing necessitates a sophisticated computer architecture and high-end hardware. To process the current or recent set of data packets, Stream Processing requires less storage. Computational requirements are reduced.

Performance

The time it takes for your data to appear in your database or data warehouse after an event occurs is known as Data Latency. Latency usually means delay and it determines the performance of Data Processing.

In Batch Processing, Latency can range from minutes to hours to days.

In Stream Processing, Latency must be in seconds or milliseconds.

Data Set

Batch Processing is the simultaneous processing of a large amount of data. Data size is known and finite in Batch Processing.

Stream Processing is a real-time analysis method for streaming data. Data size is unknown and infinite in advance when using Stream Processing.

Analysis

Batch Processing is used to perform complex computations and analyses over a longer period.

Simple reporting and computation are done using Stream Processing.

Technology Choices

For Batch Processing, there are a variety of technologies to choose from:

  • Azure Synapse Analytics is a Big Data analytics service that connects enterprise data warehousing and analytics.
  • Azure Data Lake Analytics is an on-demand analytics job service used to make big data easier to understand.
  • HDInsight is a cloud-based open-source analytics service that includes Hadoop, Apache Spark, Apache Kafka, and other open-source frameworks.
  • Azure Databricks allows us to use open-source libraries and includes the most recent version of Apache Spark.
  • Azure Distributed Data Engineering Toolkit is used to provision Spark on-demand on Docker clusters in Azure.

For Stream Processing, there are a variety of technologies to choose from:

  • Azure Stream Analytics is real-time analytics and event-processing engine that can analyze and process large amounts of fast-moving data from a variety of sources.
  • HDInsight with Storm: Apache Storm is a distributed, fault-tolerant, and open-source computation system that works with Apache Hadoop to process data streams in real-time.
  • Azure Databricks and Apache Spark
  • APIs for Azure Kafka Streams
  • HDInsight with Spark Streaming: On HDInsight Spark clusters, Apache Spark Streaming provides data stream processing.

Response and Programming Platforms

The response is given after the job is finished in Batch Processing. Some examples of distributing programming platforms for Batch Processing are MapReduce, Spark, and GraphX.

The response is given immediately in Stream Processing. Some examples of distributing programming platforms for Stream Processing are Spark Streaming and S4 (Simple Scalable Streaming System).

--

--