Big Data Processing Frameworks

I started working on Big Data in 2008 when there was no such term called Big Data. It was barebones HDFS and MapReduce. In those days, the data was ingested in HDFS, replicated manually and then MapReduce algos were used to collate the right data.

Hadoop started gaining popularity in 2010 and it was 2011 when there were a lot of new open source products introduced which simplified life for developers. In 2011 and 2012, we saw many products being opensourced and Hortonworks, Cloudera creating wrappers. 2013 saw Pivotal and MapR making rapid strides. Cloudera came up with Enterprise Data Hub and Impala, Hortonworks introduced Yarn and then Tez in 2014.

I have been talking about 3 frameworks for big data. These processing frameworks are used for computational purposes and analyzing data.

· Batch

· Streaming

· Batch + Streaming or hybrid frameworks.

We started with Batch processing, moved to micro-batching and then to separate batch and Streaming frameworks. Now we have technologies supporting hybrid frameworks in Big Data.

Batch processing: Tasks that require very large volumes of data are often best handled by batch operations. Whether the datasets are processed directly from permanent storage or loaded into memory, batch systems are built with large quantities in mind and have the resources to handle them. Because batch processing excels at handling large volumes of persistent data, it frequently is used with historical data.

The trade-off for handling large quantities of data is longer computation time. Because of this, batch processing is not appropriate in situations where processing time is especially significant.

How does Hadoop do batch processing:

MapReduce is Hadoop’s native batch processing engine.

· Reading the dataset from the HDFS filesystem

· Dividing the dataset into chunks and distributed among the available nodes

· Applying the computation on each node to the subset of data (the intermediate results are written back to HDFS)

· Redistributing the intermediate results to group by key

· “Reducing” the value of each key by summarizing and combining the results calculated by the individual nodes

· Write the calculated final results back to HDFS


Ø This process is fairly slow.

Ø MapReduce uses batch technologies and can process huge quantities of data.

Ø Writing MapReduce program is complex and has a steep learning curve.

Ø Compatibility and integration with other frameworks and engines mean that Hadoop can often serve as the foundation for multiple processing workloads using diverse technology.

Stream processing:

There is a wealth of interesting work happening in the stream processing area — ranging from open source frameworks like Apache Spark, Apache Storm, Apache Flink, and Apache Samza, to proprietary services such as Google’s DataFlow and AWS Lambda.

Let us understand stream processing basics:

Stream processors define operations that will be applied to each individual data item as it passes through the system.

· The total dataset is only defined as the amount of data that has entered the system so far.

· Processing is event-based and does not “end” until explicitly stopped. Results are immediately available and will be continually updated as new data arrives.

Below I have a diagram which explains traditional and stream processing in simpler form:

Stream processing is a good fit for data where you must respond to changes or spikes and where you’re interested in trends over time.

A batch processing framework like MapReduce needs to solve a bunch of hard problems:

  • It has to multiplex a large number of transient jobs over a pool of machines, and efficiently schedule resource distribution in the cluster.
  • To do this it has to dynamically package up your code and physically deploy it to the machines that will execute it, along with configuration, libraries, and whatever else you need to run
  • It has to manage processes and try to guarantee isolation between “jobs” sharing the cluster.

Kafka is an important component in stream processing. As we go ahead, we need to understand Kafka, trident and how Samza uses Kafka capabilities for streaming.

So what does Kafka streaming do:

Kafka Streams is much more focused in the problems it solves. It does the following:

  • Balance the processing load as new instances of your app are added or existing ones crash
  • Maintain local state for tables
  • Recover from failures

Apache Samza processing model:

Apache Samza is a stream processing framework that is tightly tied to the Apache Kafka messaging system. While Kafka can be used by many stream processing systems, Samza is designed specifically to take advantage of Kafka’s unique architecture and guarantees. It uses Kafka to provide fault tolerance, buffering, and state storage.

Samza relies on Kafka’s semantics to define the way that streams are handled. Kafka uses the following concepts when dealing with data:

Topics: Each stream of data entering a Kafka system is called a topic. A topic is basically a stream of related information that consumers can subscribe to.

Partitions: In order to distribute a topic among nodes, Kafka divides the incoming messages into partitions. The partition divisions are based on a key such that each message with the same key is guaranteed to be sent to the same partition. Partitions have guaranteed ordering.

Brokers: The individual nodes that make up a Kafka cluster are called brokers.

Producer: Any component writing to a Kafka topic is called a producer. The producer provides the key that is used to partition a topic.

Consumers: Consumers are any component that reads from a Kafka topic. Consumers are responsible for maintaining information about their own offset, so that they are aware of which records have been processed if a failure occurs.

· Samza affords some unique guarantees and features not common in other stream processing systems.

· For example, Kafka already offers replicated storage of data that can be accessed with low latency.

· It also provides a very easy and inexpensive multi-subscriber model to each individual data partition. All output, including intermediate results, is also written to Kafka and can be independently consumed by downstream stages.

· Samza’s strong relationship to Kafka allows the processing steps themselves to be very loosely tied together. An arbitrary number of subscribers can be added to the output of any step without prior coordination. This can be very useful for organizations where multiple teams might need to access similar data. Teams can all subscribe to the topic of data entering the system, or can easily subscribe to topics created by other teams that have undergone some processing. This can be done without adding additional stress on load-sensitive infrastructure like databases.

· Samza can handle load spikes better. When lot of data is written, load spikes cause an influx of data at a rate greater than components can process in real time, leading to processing stalls and potentially data loss. Kafka is designed to hold data for very long periods of time, which means that components can process at their convenience and can be restarted without consequence.

Apache Storm Stream processing Model:

Storm, a distributed computation framework for event stream processing, began life as a project of BackType, a marketing intelligence company bought by Twitter in 2011. Twitter soon open-sourced the project and put it on GitHub, but Storm ultimately moved to the Apache Incubator and became an Apache top-level project in September 2014.

Storm is designed for massive scalability, supports fault-tolerance with a “fail fast, auto restart” approach to processes, and offers a strong guarantee that every tuple will be processed. Storm defaults to an “at least once” guarantee for messages, but offers the ability to implement “exactly once” processing as well.

Storm is written primarily in Clojure and is designed to support wiring “spouts” (think input streams) and “bolts” (processing and output modules) together as a directed acyclic graph (DAG) called a topology. Storm topologies run on clusters and the Storm scheduler distributes work to nodes around the cluster, based on the topology configuration.

Storm stream processing works by orchestrating DAGs (Directed Acyclic Graphs) in a framework it calls topologies. The topologies are composed of:

Streams: Conventional data streams. This is unbounded data that is continuously arriving at the system.

Spouts: Sources of data streams at the edge of the topology. These can be APIs, queues, etc. that produce data to be operated on.

Bolts: Bolts represent a processing step that consumes streams, applies an operation to them, and outputs the result as a stream.

One of the strengths of the Storm ecosystem is a rich array of available spouts specialized for receiving data from all types of sources. While you may have to write custom spouts for highly specialized applications, there’s a good chance you can find an existing spout for an incredibly large variety of sources — from the Twitter streaming API to Apache Kafka to JMS brokers to everything in between.

In order to achieve exactly-once, stateful processing, an abstraction called Trident is also available. To be explicit, Storm without Trident is often referred to as Core Storm. Trident significantly alters the processing dynamics of Storm, increasing latency, adding state to the processing, and implementing a micro-batching model instead of an item-by-item pure streaming system.

Trident topologies are composed of:

Stream batches: These are micro-batches of stream data that are chunked in order to provide batch processing semantics.

Operations: These are batch procedures that can be performed on the data.

For pure stream processing workloads with very strict latency requirements, Storm is probably the best mature option. It can guarantee message processing and can be used with a large number of programming languages. Because Storm does not do batch processing, you will have to use additional software if you require those capabilities. If you have a strong need for exactly-once processing guarantees, Trident can provide that. However, other stream processing frameworks might also be a better fit at that point.

Hybrid processing systems: Batches/ micro-batches and Stream processors:

Some processing frameworks can handle both batch and stream workloads. While projects focused on one processing type may be a close fit for specific use-cases, the hybrid frameworks attempt to offer a general solution for data processing. They not only provide methods for processing over data, they have their own integrations, libraries, and tooling for doing things like graph analysis, machine learning, and interactive querying.

Spark: Built using many of the same principles of Hadoop’s MapReduce engine, Spark focuses primarily on speeding up batch processing workloads by offering full in-memory computation and processing optimization.

Spark Batch processing:

Unlike MapReduce, Spark processes all data in-memory, only interacting with the storage layer to initially load the data into memory and at the end to persist the final results. All intermediate results are managed in memory.

While in-memory processing contributes substantially to speed, Spark is also faster on disk-related tasks because of holistic optimization that can be achieved by analyzing the complete set of tasks ahead of time. It achieves this by creating Directed Acyclic Graphs, or DAGs which represent all of the operations that must be performed, the data to be operated on, as well as the relationships between them, giving the processor a greater ability to intelligently coordinate work.

To implement in-memory batch computation, Spark uses a model called called Resilient Distributed Datasets, or RDDs, to work with data. These are immutable structures that exist within memory that represent collections of data. Operations on RDDs produce new RDDs. Each RDD can trace its lineage back through its parent RDDs and ultimately to the data on disk. Essentially, RDDs are a way for Spark to maintain fault tolerance without needing to write back to disk after each operation.

Spark Stream Processing:

Stream processing capabilities are supplied by Spark Streaming. Spark itself is designed with batch-oriented workloads in mind. To deal with the disparity between the engine design and the characteristics of streaming workloads, Spark implements a concept called micro-batches*. This strategy is designed to treat streams of data as a series of very small batches that can be handled using the native semantics of the batch engine.

Spark Streaming works by buffering the stream in sub-second increments. These are sent as small fixed datasets for batch processing. In practice, this works fairly well, but it does lead to a different performance profile than true stream processing frameworks.

Advantages and Limitations

· The obvious reason to use Spark over Hadoop MapReduce is speed. Spark can process the same datasets significantly faster due to its in-memory computation strategy and its advanced DAG scheduling.

· Another of Spark’s major advantages is its versatility. It can be deployed as a standalone cluster or integrated with an existing Hadoop cluster. It can perform both batch and stream processing, letting you operate a single cluster to handle multiple processing styles.

· Beyond the capabilities of the engine itself, Spark also has an ecosystem of libraries that can be used for machine learning, interactive queries, etc. Spark tasks are almost universally acknowledged to be easier to write than MapReduce, which can have significant implications for productivity.

· Adapting the batch methodology for stream processing involves buffering the data as it enters the system. The buffer allows it to handle a high volume of incoming data, increasing overall throughput, but waiting to flush the buffer also leads to a significant increase in latency. This means that Spark Streaming might not be appropriate for processing where low latency is imperative.

· Since RAM is generally more expensive than disk space, Spark can cost more to run than disk-based systems. However, the increased processing speed means that tasks can complete much faster, which may completely offset the costs when operating in an environment where you pay for resources hourly.

· One other consequence of the in-memory design of Spark is that resource scarcity can be an issue when deployed on shared clusters. In comparison to Hadoop’s MapReduce, Spark uses significantly more resources, which can interfere with other tasks that might be trying to use the cluster at the time. In essence, Spark might be a less considerate neighbor than other components that can operate on the Hadoop stack.

· Spark is a great option for those with diverse processing workloads.

· Spark batch processing offers incredible speed advantages, trading off high memory usage.

· Spark Streaming is a good stream processing solution for workloads that value throughput over latency.

Apache Flink:

Apache Flink is a stream processing framework that can also handle batch tasks. It considers batches to simply be data streams with finite boundaries, and thus treats batch processing as a subset of stream processing. This stream-first approach to all processing has a number of interesting side effects.

For Flink, I shall elaborate on the Stream processing model first since the batch processing model is an extension of the stream processing model.

Flink stream processing model

Flink’s stream processing model handles incoming data on an item-by-item basis as a true stream. Flink provides its DataStream API to work with unbounded streams of data.

Stream processing tasks take snapshots at set points during their computation to use for recovery in case of problems. For storing state, Flink can work with a number of state backends depending with varying levels of complexity and persistence.

Additionally, Flink’s stream processing is able to understand the concept of “event time”, meaning the time that the event actually occurred, and can handle sessions as well. This means that it can guarantee ordering and grouping in some interesting ways.

Flink batch processing model

Flink’s batch processing model in many ways is just an extension of the stream processing model. Instead of reading from a continuous stream, it reads a bounded dataset off of persistent storage as a stream. Flink uses the exact same runtime for both of these processing models.

Flink offers some optimizations for batch workloads. For instance, since batch operations are backed by persistent storage, Flink removes snapshotting from batch loads. Data is still recoverable, but normal processing completes faster.

Another optimization involves breaking up batch tasks so that stages and components are only involved when needed. This helps Flink play well with other users of the cluster. Preemptive analysis of the tasks gives Flink the ability to also optimize by seeing the entire set of operations, the size of the data set, and the requirements of steps coming down the line.

Advantages and Limitations

· Flink is currently a unique option in the processing framework world. While Spark performs batch and stream processing, its streaming is not appropriate for many use cases because of its micro-batch architecture. Flink’s stream-first approach offers low latency, high throughput, and real entry-by-entry processing.

· Flink manages many things by itself. Somewhat unconventionally, it manages its own memory instead of relying on the native Java garbage collection mechanisms for performance reasons. Unlike Spark, Flink does not require manual optimization and adjustment when the characteristics of the data it processes change. It handles data partitioning and caching automatically as well.

· Flink analyzes its work and optimizes tasks in a number of ways. Part of this analysis is similar to what SQL query planners do within relationship databases, mapping out the most effective way to implement a given task. It is able to parallelize stages that can be completed in parallel, while bringing data together for blocking tasks. For iterative tasks, Flink attempts to do computation on the nodes where the data is stored for performance reasons. It can also do “delta iteration”, or iteration on only the portions of data that have changes.

· In terms of user tooling, Flink offers a web-based scheduling view to easily manage tasks and view the system. Users can also display the optimization plan for submitted tasks to see how it will actually be implemented on the cluster. For analysis tasks, Flink offers SQL-style querying, graph processing and machine learning libraries, and in-memory computation.

· Flink operates well with other components. It is written to be a good neighbor if used within a Hadoop stack, taking up only the necessary resources at any given time. It integrates with YARN, HDFS, and Kafka easily. Flink can run tasks written for other processing frameworks like Hadoop and Storm with compatibility packages.

· One of the largest drawbacks of Flink at the moment is that it is still a very young project. Large scale deployments in the wild are still not as common as other processing frameworks and there hasn’t been much research into Flink’s scaling limitations. With the rapid development cycle and features like the compatibility packages, there may begin to be more Flink deployments as organizations get the chance to experiment with it.


There are plenty of options for processing within a big data system.

For batch-only workloads that are not time-sensitive, Hadoop is a good choice that is likely less expensive to implement than some other solutions.

For stream-only workloads, Storm has wide language support and can deliver very low latency processing, but can deliver duplicates and cannot guarantee ordering in its default configuration.

Samza integrates tightly with YARN and Kafka in order to provide flexibility, easy multi-team usage, and straightforward replication and state management.

For mixed workloads, Spark provides high speed batch processing and micro-batch processing for streaming. It has wide support, integrated libraries and tooling, and flexible integrations.

Flink provides true stream processing with batch processing support. It is heavily optimized, can run tasks written for other platforms, and provides low latency processing, but is still in the early days of adoption.

The best fit for your situation will depend heavily upon the state of the data to process, how time-bound your requirements are, and what kind of results you are interested in. There are trade-offs between implementing an all-in-one solution and working with tightly focused projects, and there are similar considerations when evaluating new and innovative solutions over their mature and well-tested counterparts.


From Ian HellStrom

Like what you read? Give Saran a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.