From Batches to Streams: Different Ways for Ingesting Data (Part 1)

9 min readApr 21, 2023

TLDR

This article covers the different data processing approaches, including batch and stream processing. It delves into the different types of stream processing, such as hard-real time, soft-real time, and near real-time, as well as the micro batching, windowing, and checkpointing stream processing models. Additionally, it explores hybrid processing and the key trends in data processing approaches.

Table of Contents (TOC)

I. Introduction
II. Batch Processing
III. Stream Processing
IV. Hybrid Processing
V. The Key Trends of Data Processing Approaches
VI. Conclusion

Introduction

In today’s digital age, data is everywhere, growing at an unprecedented pace. Organizations across industries are struggling to keep up with the increasing volume, variety, and velocity of data (3 V’s). Processing large amounts of data efficiently and effectively has become a critical challenge for businesses of all sizes. Two popular approaches to data processing are batch and stream processing. Batch processing refers to processing data in large, discrete chunks, while stream processing involves processing data in a continuous flow.

Whether you are a data analyst, data engineer, or simply interested in the world of data processing, this article will provide valuable insights into the world of batch and stream processing and help you make informed decisions about which approach is best suited for your business needs.

Batch Processing

Batch processing involves processing large volumes of data collected over some time in a single shot. A batch processing job runs at regular intervals or on-demand with a bounded dataset. This type of processing is typically used for historical analysis, such as generating reports, performing analytics, and running machine learning algorithms. Batch processing frameworks such as Apache Hadoop MapReduce and Apache Spark provide distributed computing capabilities to process large data sets in parallel. Batch processing frameworks are ideal for processing data that is not time-sensitive and can wait for the processing to complete in other words batch processing is more concerned about throughput than latency.

Stream Processing

Stream processing, on the other hand, involves processing data in real-time as it is generated. A Streaming Job runs continuously whenever there is an unbounded dataset. This type of processing is used for time-sensitive data that require immediate analysis, such as financial transactions, sensor data, and social media feeds. Stream processing frameworks such as Apache Kafka and Apache Storm provide real-time data processing capabilities by breaking down the data into smaller chunks and processing it in real-time. Stream processing frameworks are ideal for processing data that needs to be analyzed immediately or as we see in the later sections they are classified as hard, soft, and near real-time processing systems but overall stream processing is more concerned about latency than throughput.

Hard Real-time Processing Systems

Hard real-time processing systems are designed to process data with a guaranteed response time. These systems are used in applications where even a small delay can have catastrophic consequences, such as in industrial control systems, aircraft control systems, medical equipment, anti-lock brakes, and pacemakers. In hard real-time processing systems, the response time is guaranteed and typically measured in microseconds or even nanoseconds. To achieve such low latency, hard real-time processing systems must have a predictable execution environment and the ability to prioritize tasks based on their urgency. These systems typically run on specialized hardware and are designed to be fault-tolerant, with redundant components and failover mechanisms to ensure no downtime.

Soft Real-time Processing Systems

Soft real-time processing systems are designed to process data with a response time that is not guaranteed but is still typically within a certain range. These systems are used in applications where a delay may be acceptable, but the response time must still be relatively fast. Examples of soft real-time processing systems include online shopping systems, online gaming systems, traffic control systems, and online stock quotes. In soft real-time processing systems, the response time is typically measured in milliseconds or seconds. To achieve low latency, soft real-time processing systems must have a predictable execution environment and the ability to prioritize tasks. These systems typically run on commodity hardware and are designed to be highly available with redundant components and failover mechanisms to ensure maximum uptime but the degree of fault tolerance and availability may vary depending on the specific requirements of the system.

Near Real-Time Processing Systems

Near real-time processing systems fall somewhere between hard and soft real-time processing systems. These systems are designed to process data with a response time that is close to real-time but not quite real-time. Examples of near real-time processing systems include Skype video, home automation, fraud detection systems, social media monitoring systems, and network monitoring systems. In near real-time processing systems, the response time is typically measured in seconds or minutes. These systems typically run on commodity hardware and are designed to be highly available with redundant components and failover mechanisms to ensure maximum uptime. However, unlike hard real-time processing systems, near real-time processing systems may not have a guaranteed response time, and may not be suitable for applications where even a slight delay can have catastrophic consequences.

The Overlap between Soft Real-Time and Near Real-Time

The line between soft and near real-time becomes blurry, and it is very subjective, depending on the end consumer and the specific use case. For example, if someone posts a tweet on Twitter, the response time required to display the tweet may depend on the user’s expectations and preferences.

Some users may prefer to see the tweet immediately after it is posted and consider a few seconds delay as unacceptable, while others may be more tolerant of a delay of a few minutes. Therefore, the response time required to display the tweet can vary depending on the user’s preferences, and what may be considered near-real-time for one user may be considered soft real-time for another.

Similarly, in other use cases such as fraud detection or online stock quotes, the timing requirements may also vary depending on the specific use case and the expectations of the users. Therefore, the line between soft and near real-time can be subjective and depends on the end consumer’s expectations and preferences.

Overall, it is important to consider the specific requirements of each use case and the expectations of the end consumer when defining the response time requirements for soft and near-real-time systems. This can help ensure that the system’s timing requirements are appropriately defined to meet the needs of the end consumer and provide a satisfactory user experience.

The Overlap between Soft and Near Real-Time

Micro Batching Stream Processing Model

Micro-batching is a streaming processing model that is used to process data in small, finite batches. In this model, incoming data is grouped into small batches, typically on the order of a few seconds, and each batch is processed as a batch job. This approach allows for near-real-time processing while avoiding the complexities and overhead of true stream processing. That’s why Apache Spark is considered a near real-time stream processing engine, not a true stream processing engine.

Windowing and Checkpointing Stream Processing Models

Windowing and checkpointing are important features of streaming processing systems that enable efficient and fault-tolerant processing of continuous data streams. Apache Flink is an example of a system that supports these features as part of its streaming processing model.

Windowing is a technique used in streaming processing to divide the continuous data stream into discrete, bounded subsets called windows. By dividing the stream into windows, we can perform computations on subsets of the data stream instead of the entire stream. This approach allows us to efficiently process large amounts of data in real-time while still retaining flexibility in how we analyze the data.

Checkpointing is another important feature of streaming processing systems that are used to ensure fault tolerance in the event of failures. Checkpointing involves taking periodic snapshots of the system’s state and saving them to a durable storage system. In the event of a failure, the system can recover from the last checkpoint and resume processing from where it left off. This approach provides fault tolerance and ensures that data processing continues even in the event of hardware or software failures.

Hybrid Processing

Hybrid processing combines real-time and batch processing techniques to handle different types of workloads. In hybrid processing, real-time processing is used to handle data that requires immediate attention, such as data streams from sensors, while batch processing is used for large-scale data processing tasks that can be scheduled and processed in batches. For example, when processing financial data, it may be necessary to perform batch processing on historical data to identify trends and perform predictive analytics, while also performing real-time stream processing to detect fraud and other anomalies This approach allows for a balance between the need for real-time data processing and the efficiency of batch processing. Hybrid processing frameworks such as Apache Flink, Apache Spark, and Apache Beam provide both batch and stream processing capabilities in a single platform.

The Key Trends of Data Processing Approaches

The big data processing landscape has evolved rapidly over the past decade, with new frameworks and technologies emerging to address the challenges of processing and analyzing large volumes of data. Looking to the future, several trends and developments are likely to shape the big data processing landscape in the coming years.

One key trend is the growing importance of real-time data processing. As more data is generated in real time from sources such as IoT devices, social media, and streaming services, big data processing frameworks will need to evolve to handle this data in a timely and efficient manner. Frameworks such as Apache Flink and Apache Beam are already well-positioned to address this trend, with support for real-time data processing and streaming analytics.

Conclusion

In conclusion, both batch and stream processing frameworks are essential for big data processing. Batch processing is ideal for historical analysis, while stream processing is ideal for real-time data analysis. Hybrid processing frameworks provide the best of both worlds by enabling both batch and stream processing in a single platform. It is essential to choose the right processing framework based on the specific needs of your data processing requirements.

Batch Processing vs. Stream Processing vs. Hybrid Processing

We value your feedback and would love to hear your thoughts on this article. What did you find most helpful or insightful? What could we have done better? Let us know in the comments below.

We look forward to sharing the next parts with you as this is the 1st part of the three-part series and hearing your thoughts along the way. Thank you for reading and for your time!

Credits

- Written by Mohamed Awnallah

- Reviewed by Riku Driscoll, Zacharias Voulgaris, Stanley Ndagi