Apache Apex in a Nutshell

22 min readFeb 20, 2018

Hello and welcome to another story from the Packt Expert Network. We’ll take a look at an overview of what Apache Apex is written by Thomas Weise, Munagala V. Ramanath, David Yan, and Kenneth Knowles, authors of the book, Learning Apache Apex.

The world is producing, at unprecedented levels, data from a rapidly growing number of sources including mobile devices, sensors, industrial machines, financial transactions, web logs, etc. Often, these streams of data can offer valuable insights if processed, in flight, quickly and reliably, and businesses are finding it increasingly important to take action on this “data in motion” in order to remain competitive. MapReduce was among the first technologies to enable processing of very large data sets on clusters of commodity hardware. Apache Spark then took over to replace heavy reliance on disk I/O with a more efficient, memory-based approach. Still, the downside of batch processing systems is the high latencies they entail as they accumulate data into batches, sometimes over hours, and cannot address those use cases that require fast time to insight for continuous data in motion.

Such requirements can be handled by newer stream processing systems, which can process data in “real time,” sometimes with latency as low as a few milliseconds. Apache Storm was the first ecosystem project to offer this capability, albeit with prohibitive trade-offs such as reliability vs. latency. Today, there are newer frameworks that enable low latency, high throughput, reliability, and a unified architecture that can be applied to both streaming and batch use cases. One such framework is Apache Apex, a next generation platform for processing data in motion.

Data sets can be classified as unbounded or bounded. Bounded data is finite; it has a beginning and end. Unbounded data is a “type of ever-growing, essentially infinite data set.” The distinction is independent of how the data is processed. Often unbounded data is equated to stream processing and bounded data to batch processing, but this is changing. We will see later how state of the art stream processors like Apache Apex can be used for, and are very capable of, processing both unbounded and bounded data, so there is no need for a batch processing system just because the data set happens to be finite.

Most big data sets (high volume) that are eventually processed by big data systems are unbounded (like user activity on internet sites, mobile devices, IoT and sensors). The goal is to convert data into meaningful business insights and competitive advantage and increasingly, businesses are relying on high velocity data processing since the value of data diminishes as it ages.

How were unbounded data sets processed in the past without stream-processing platforms? The answer is that, in order to be consumable by a batch processor, they had to be divided into bounded data sets, often at intervals of hours. Before processing could begin, the earliest events would wait for a long time for their batch to be complete and ready for processing. At the time of processing, data would, therefore, already be old and less valuable.

Stream processing

Stream processing means processing each event, as soon as it is available. Because there is no waiting for more input after an event arrives, there is no artificially added latency (unlike with batching), which is important for real-time use cases. But stream processing is not limited to real-time data, as we will see there are benefits to applying this continuous processing in a uniform manner to historical data as well.

Consider line-oriented data stored in a file. By reading the file a line at a time and processing it as soon as it is read, subsequent processing steps can be performed while the file is still being read, instead of waiting for the entire input to be read before initiating the next stage. Stream processing is a pipeline, each item can be acted upon immediately. Apart from low latency this can also lead to a more uniform and predictable pattern of resource consumption (memory, CPU, network) with steady (vs. bursty) throughput, when operations performed inherently don’t require any blocking. The following diagram illustrates a common word-counting pipeline:

Data flows through the pipeline as individual events and all processing steps are active at the same time. In a distributed system operations are performed on different nodes, and data flows through the system, allowing for parallelism and high throughput. Processing is decentralized and without any inherent bottlenecks, in contrast to architectures that attempt to move processing to where the data resides.

Stream processing can perform operations such as sum, average, max, or min over groups of events known as windows illustrated by the following diagram:

Windowing slices the stream at temporal boundaries to demarcate finite data sets on which operations can be performed. All data belonging to a window needs to be observed before a result can be emitted and windowing provides these boundaries. There are different strategies to define windows over a data stream. In general the final result of an operation for a given window is only known after all its data elements are processed. However, windowing doesn’t always mean that processing can only start once all input has arrived and sometimes even the intermediate result of a windowed computation is of interest, and can be made available for downstream consumption (and subsequently refined with the final result).

Stream processing systems

The first open source stream processing framework was Apache Storm. Since then, several other Apache projects for stream processing have emerged. The next generation of streaming-first architectures like Apache Apex and Apache Flink come with stronger capabilities and are more broadly applicable. They are not only able to process data with low latency, they also provide for state management (for data that an operation may require across multiple events), strong processing guarantees (correctness), fault-tolerance, scalability and high performance. Users can now also expect such frameworks to come with comprehensive libraries of connectors, other building blocks and APIs that make development of non-trivial streaming applications, productive and allow for predictable project implementation cycles. Equally importantly, next-gen frameworks should cater to aspects such as operability, security and the ability to run on shared infrastructure (multi-tenancy) to satisfy DevOps requirements for successful production launch and uptime. Given the vastly expanded capabilities of these modern streaming platforms, stop-gap measures such as Lambda Architecture (which required building and maintaining two parallel workflows, one for batch and one for streaming) are no longer needed.

What is Apex and why is it important

According to its official project Web site at the Apache Software Foundation, Apex’s mission is to be a modern stream processing framework that can process data in-motion with low latency in a way that is highly scalable, highly performant, fault tolerant, stateful, secure, distributed, and easily operable. Apex is written in Java, and Java is the primary application development environment.

In a typical streaming data pipeline, events from sources are stored and transported through a system such as Apache Kafka. The events are then processed by a stream processor and the results delivered to sinks, frequently those are databases, distributed file systems or message buses that link to downstream pipelines or services:

In this end-to-end scenario, we see Apex as the processing component. The processing can be complex logic, with operations performed in sequence or in parallel in a distributed environment.

Apex runs on cluster infrastructure and currently supports and depends on Apache Hadoop, for which it was originally written. Support for other cluster resource managers is on the roadmap.

Apex supports integration with many external systems out of the box with connectors that are maintained and released by the project, including but not limited to the systems shown above. The most frequently used connectors are probably Kafka and file readers. Frequently used sinks for the computed results are files and databases, but results can also be delivered directly to a front-end system, such as for real-time reporting directly from the Apex application, a use case that we will look at later.

Apex had its first production deployments in 2014 and today is used in mission critical deployments in various industries for processing at scale. Use cases range from very low latency processing in the real-time category to large scale batch processing. Some of the organizations that use Apex can be found on the Powered by Apache Apex page on the Apex project web site at https://apex.apache.org/powered-by-apex.html.

Apex is a platform and framework on top of which specific applications (or solutions) are built. As such, Apex is applicable to to a wide range of use cases, including real-time machine learning model scoring, real-time ETL (Extract, Transform, and Load), predictive analytics, anomaly detection, real-time recommendations, and systems monitoring:

As organizations realize the financial and competitive importance of making data-driven decisions in real time, the number of industries and use cases will grow.

Apex application model and API

An Apex application is represented by a Directed Acyclic Graph (DAG), which expresses processing logic as operators (vertices) and streams (edges). Streams are unbounded sequences of events, called tuples. The logic that can be executed is arranged in the DAG in sequence or in parallel.

The resulting graph must be acyclic, meaning that any given tuple is processed only once by an operator. An exception to this is iterative processing, also supported by Apex, whereby the output of an operator becomes the input of a predecessor (or upstream operator), introducing a loop in the graph. This construct is frequently found in machine learning.

Operators are the functional building blocks of Apex; they can contain custom code specific to a single use case or generic functionality that can be applied broadly. The Apex library (to be introduced later) contains reusable operators, including connectors that read from various sources, and operators that perform filtering, transformations and output to various destinations. Operators receive events via input ports and emit events via output ports. Operators that don’t receive events via ports are called input operators and considered the roots of the DAG, as they receive events from external systems. The full DAG is visible to the engine, which means it can be translated into an end-to-end, fault-tolerant, scalable execution layer.

In the following sections, we will introduce the different APIs that Apex offers to specify applications. All of these representations are eventually translated into the native DAG, which is the input for the Apex engine to launch an application.

Apex DAG Java API

The low-level DAG API is defined by the Apex engine. Any application that runs on Apex, irrespective of the original specification format, will be translated into this API. It is sometimes also referred to as compositional, as it represents the logical DAG, which will be translated into a physical plan and mapped to the execution layer by the Apex runtime.

The following is the Word Count Example Application written with the DAG API:

LineReader lineReader = dag.addOperator(“input”, new LineReader());Parser parser = dag.addOperator(“parser”, new Parser());UniqueCounter counter = dag.addOperator(“counter”, new UniqueCounter());ConsoleOutputOperator cons = dag.addOperator(“console”, new ConsoleOutputOperator());dag.addStream(“lines”, lineReader.output, parser.input);dag.addStream(“words”, parser.output, counter.data);dag.addStream(“counts”, counter.count, cons.input);

The developer is provided with a DAG handle (in this case, dag), through which operators are added and then connected with streams.

The flow of data is defined through streams, which are connections between ports. Ports are the endpoints of operators to receive data (input ports) or emit data (output ports). Each operator can have multiple ports and each port is connected to at most one stream (ports can also be optional, in which case they don’t have to be connected in the DAG). We will look at ports in more detail when discussing operator development. For now it is sufficient to know that ports provide the type-safe endpoints through which the application developer specifies the data flow by connecting streams.

High-level Stream Java API

The high-level Apex Stream Java API provides an abstraction from the lower level DAG API discussed above. It is a declarative, fluent style API that is easier to learn for someone new to Apex. Instead of identifying individual operators, the developer works with methods on the stream interface to specify the transformations. The API will internally keep track of the operator(s) needed for each of the transformations and eventually translate it into the lower level DAG. The high level API is part of the Apex library and outside of the Apex engine, which does not need to know about it.

Here is the Word Count example application written with the high level API (using Java 8 syntax):

StreamFactory.fromFolder(“/tmp”)   .flatMap(input -> Arrays.asList(input.split(“ “)), name(“Words”))   .window(new WindowOption.GlobalWindow(),           new TriggerOption().accumulatingFiredPanes().withEarlyFiringsAtEvery(1))   .countByKey(input -> new Tuple.PlainTuple<>(new KeyValPair<>(input, 1L)), name(“countByKey”))   .map(input -> input.getValue(), name(“Counts”))   .print(name(“Console”))   .populateDag(dag);

Windowing is supported and stateful transformations can be applied to a windowed stream, as shown with countByKey above.

In addition to the transformations that are directly available through the Stream API, the developer can also use other (possibly custom) operators through the addOperator(..) and endsWith(..) methods. For example, if output should be written to JDBC, the connector from the library can be integrated using these generic methods instead of requiring the stream API to have a method like toJDBC.

It is also possible to extend the Stream API with custom methods to provide new transformations without exposing the details of the underlying operator. An example for this extension mechanism can be found in the malhar-stream module unit tests.

For readers interested in exploring the API further, there is a set of example applications in the apex-malhar repository under https://github.com/apache/apex-malhar/tree/master/examples/highlevelapi

SQL

SQL is widely used for data transformation and access, not only with traditional relational databases, but also in the Apache big data space with projects like Hive, Drill, Impala and several others. They all let the user process bounded data at rest using familiar SQL syntax, without other programming skills. SQL can be used for ETL purposes but the most common use is for querying data, either directly or through the wide range of SQL compatible BI tools.

SQL is relatively new in the stream processing area, where it is used as a declarative approach to define streaming applications. Apex is using Apache Calcite for its SQL support, which has already been adopted by several other big data processing frameworks. Instead of every project coming up with its own declarative API, Calcite aims to make SQL the common language. Calcite accepts standard SQL, translates it into relational algebra, facilitates query planning and optimization and allows for integration of any data source that can provide collections of
records with columns (files, queues, and so on).

With unbounded data sources, the processing of SQL becomes continuous and it is necessary to express windows on the stream that define boundaries at which results can be computed and emitted. Calcite provides streaming SQL extensions to support unbounded data (https://calcite.apache.org/docs/stream.html).

The SQL support in Apex currently covers select, insert, where clause, inner join, and scalar functions. Endpoints (sources and sinks) can be file, Kafka or streams that are defined with the DAG API (“fusion style”), and CSV is supported as data format. A sample SQL query (that also demonstrates the use of user-defined functions CALLTYPE and COST) that can be processed by Apex is:

INSERT INTO OutputSchemaSELECT STREAM TSTAMP, CALLTYPE(type), origin, destination, COST(origin, destination, duration)

Here is a simple example to illustrate the translation of SQL into an Apex DAG:

Support for windowing and GROUP BY (for aggregations) is on the road map, based on the scalable window and accumulation state management of the Apex library, which the book covers in detail.

Windowing and time

Streams of unbounded data require windowing to establish boundaries to process data and emit results. Processing always occurs in a window, and there are different types of windows and strategies to assign individual data records to windows.

Often the relationship of data processed and time is explicit, when the data contains a timestamp that identifies the event time, or when an event occurred. This is usually the case with streaming sources that emit individual events. However, there are also cases where time can be derived from a container. For example, when data arrives batched in hourly files, and time may be derived from the file name instead of individual records. Sometimes data may arrive without any timestamp, and the processor at the source would need to assign a timestamp based on arrival time or processing time in order to perform stateful windowed operations.

The windowing support provided by the Apex library that we discuss here largely follows the Apache Beam model. It is flexible and broadly applicable to different use cases. It is also completely different from and not be be confused with the Apex engine’s native arrival time based streaming window mechanism. The latter can be applied to use cases that don’t require event time handling: it assumes that the stream can be sliced into fixed time intervals (default 500ms), at which the engine performs callbacks that the operator can use to (globally) perform aggregate operations over multiple records that arrived in that interval. The intervals are aligned with the internal checkpointing mechanism and suitable for processing optimizations such as flushing files or batching writes to a database. It cannot support transformation and other processing based on event time, because events in the real world don’t necessarily arrive in order and perfectly aligned with these internal intervals:

The example above shows a sequence of events, each with timestamp and in their processing order. Note the difference between processing and event time. It should be possible to process the same sequence at different times with the same result. That’s only possible when the transformations understand event time and are capable maintaining the computational state (potentially multiple open windows at the same time with high key cardinality). The example shows how the state tracks multiple windows (global, 4:00 and 5:00) and performs counting regardless of processing time.

The Apex library has support for flexible, windowed, stateful transformations based on the WindowedOperator, which can handle:

Keyed or non-keyed data
Windows: Global window (can be used for batch), fixed duration time window, sliding time window, session windows.
Timestamp Extractor: Timestamp Extractor derives the window timestamp from the data tuple.
Watermarks: Watermarks control tuples that include a timestamp. A watermark conveys that all windows that lie completely before the given timestamp are considered late, and the rest of the windows are considered early.
Allowed Lateness: Allowed Lateness specifies the lateness horizon from the watermark. Data with a timestamp that lies beyond the lateness horizon is dropped and the window state is purged.
Accumulation: Accumulation defines the mutation of state upon arrival of an incoming tuple (this is the business logic). The following accumulation modes are supported: Accumulating, Discarding, Accumulating, Retracting
Triggers: Triggers cause results of the accumulation to be emitted downstream. There are two types of triggers: time-based triggers and count-based triggers.
Window Propagation: It is possible to chain multiple windowed transformations and have only the most upstream instance assign the windows that all downstream instances inherit.
Merging of Streams: Through merge accumulation the user can implement their custom merge or join accumulation based on their business logic. Examples of this type of accumulation are Inner Join and Cogroup.
State storage: Accumulations are stateful and the state can be large. Each window (or each window/key pair when keyed) has its own state, and how the state is stored and checkpointed is likely to be the most important factor for performance. Apex provides large scale, incremental state management, by default with DFS as backing store, which requires no additional external systems.

The windowing features are also available through the high level stream API. For more information see: http://apex.apache.org/docs/malhar/operators/windowedOperator/

Apache Beam

The book includes a chapter about Apache Beam. Apache Beam is a new programming model and library for portable massive-scale data processing — both batch and streaming. Using Beam, data processing pipelines can be written once, and then be executed on various data processing engines (called Runners), including Apache Apex. With that, Beam provides an alternative to the Apex native APIs to write a pipeline that can execute on Apex:

A Beam pipeline is independent of what language it is written in. The library with which the data processing pipeline is described is called an SDK. In addition to the basic wiring together of a pipeline, an SDK specifies how user-defined functions, or UDFs, such as custom Java functions, to map over all elements of a data stream, are defined. Beam has SDKs for Java, Python, and Go, with varying levels of maturity. The Java SDK is well-supported on all runners, because most existing runners are also JVM-based and can easily execute Java UDFs directly. A major effort is underway to adapt all runners to use Beam’s portability framework, which will make the Python and Go SDKs available to all runners, and all runners available to Python and Go aficionados.

Beam’s vision is to provide portability of any SDK on any data processing engine, including multiple languages, in the same pipeline. This unlocks reuse of big data processing components across languages. For example, in Beam Java, there are many mature connectors to external systems, including HBase, Kinesis, Hadoop InputFormat, Tika, MongoDb, Cassandra, Elasticsearch, Apache Solr, MQTT, Kafka, Redis, and Google Cloud Platform (BigQuery, Pub/Sub, Bigtable, Datastore, Spanner). There is also a Beam SQL library based on Apache Calcite. Python and Go users can immediately have access to this growing library of transforms. Conversely, Python provides tf.Transform, a connector to TensorFlow.

Value proposition of Apex

Over the last 10 years there has been a lot of hype around Hadoop, but the success rate of projects has not kept up. Challenges include:

A very large number of tools and vendors with often confusing positioning, making it difficult to evaluate and identify the right options
Complexity in development and integration, a steep learning curve, and long time to production
Scarcity of skill set: experts in the technology are difficult to hire
Production-readiness: often the primary focus is on features and functionality while operational aspects are sidelined, which is a problem for business critical systems.

“Big Data success is not about implementing one piece of technology (like Hadoop or anything else), but instead requires putting together an assembly line of technologies, people and processes.” (link here)

So how does Apex help to succeed with stream data processing use cases?

Since its inception, the project was focused on enterprise readiness as a key architectural requirement, including aspects like:

Fault tolerance and high availability of all components, automatic recovery from failures, and ability to resume applications from previous state.
Stateful processing architecture with strong processing guarantees (end-to-end exactly-once) to enable mission critical use cases that depend on correctness.
Scalability and superior performance with high throughput and low latency. Ability to process millions of events per second without compromising fault-tolerance, correctness and latency.
Security, multi-tenancy and operability, including a REST API with metrics for monitoring, etc.
A comprehensive library of connectors for integration with the external systems typically found in enterprise architecture. The library is an integral part of the project, maintained by the community and guaranteed to be compatible with the engine.
Ability for code reuse in the JVM environment, and Java as the primary development language, which has a very rich ecosystem and large developer base that is accessible to the kinds of customers who require big data solutions.

With several large scale, mission critical deployments in production, some of which we discussed above, Apex has proven that it can deliver.

Apex requires a cluster to run on, and as of now, this means a Hadoop cluster with YARN and HDFS. Apex will likely support Mesos in the near future and other cluster managers as they gain adoption in the target enterprise space. Running on top of a cluster allows Apex to provide features such as dynamic scaling and resource allocation, automatic recovery and support for multi-tenancy. For users that already have Hadoop clusters as well as the operational skills and processes to run the infrastructure, it is easy to deploy an Apex application, as it does not require installation of any additional components on cluster nodes. If no existing Hadoop cluster is available, there are several options to get started with varying degrees of upfront investment, including cloud deployment such as Amazon EMR, installation of any of the distros (Cloudera, Hortonworks, MapR) or just a Docker image on a local laptop for experimentation.

Big data applications in general are not trivial, especially not the pipelines that solve complex use cases and have to run in production 24/7 without downtime. When working with Apex, the development process, APIs, library and examples are tailored to enable a Java developer to become productive and see results quickly. By using readily available connectors for sources and sinks, it is possible to quickly build an initial proof of concept (PoC) application that consumes real data, does some of the required processing and stores results. The more involved custom development for use case specific business logic can then occur in iterations. The process of building an Apex application will be covered in detail in the next chapter.

Apex separates functionality (or business logic) and engine behavior. Aspects such as parallelism, operator chaining/locality, checkpointing and resource allocations for individual operators can all be controlled through configuration and modified without affecting the code or triggering a full build/test cycle. This allows benchmarking and tuning to take place independently. For example, it is possible to run the same packaged application with different configurations to test tradeoffs, such as lower parallelism / longer time to completion (batch use case), etc.

Stateful processing

Apex is a native streaming architecture. As discussed, this allows processing of events as soon as they arrive without any artificial delay, which enables real-time use cases with very low latency. Another important capability is stateful processing. This applies to stateful transformations and windowing, which may require a potentially very large amount of computational state. But state also needs to be tracked in connectors for correct interaction with external systems. For example, the Apex Kafka connector will keep track of partition offsets as part of its checkpointed state so that it can correctly resume consumption after recovery from failure. Similarly, state is required for reading from files and other sources. And for sources that don’t allow for replay, it is even necessary to retain all consumed data in the connector until it has been fully processed in the DAG.

Given this need to track state, it is helpful to understand how a stateful streaming architecture can provide support for this compared to a “stateless” batch processor:

On top we see an example of processing in Spark Streaming, and below it we see an example in Apex. Based on its underlying batch architecture, Spark Streaming processes a stream by dividing it into small batches (micro-batches) that typically last from 500ms to a few seconds. A new task needs to be scheduled for every micro-batch, which leads to a busy scheduler and some limit on the batch frequency. Once scheduled, each task needs to be initialized. Such initialization could include opening connections to external resources, loading data that is needed for processing and so on.

In classical batch processing, tasks may last for the entire bounded input data set. Any computational state remains internal to the task and there is typically no special consideration for fault-tolerance required, since whenever there is a failure, the task can restart from the beginning.

With unbounded data and streaming that’s different. A stateful operation like counting would need to maintain the current count and it would need to be transferred across task boundaries. As long as the state is small, this may be manageable. However, when transformations are applied to large key cardinality, the state can easily grow to a size that makes it impractical to simply swap in and out (cost of serialization, I/O etc.). The user is now confronted with an additional problem that the micro-batch platform doesn’t solve: efficiently managing the state. And solving this correctly can be difficult, especially when accuracy, consistency and fault-tolerance are important.

In a stateful and fault tolerant streaming architecture (like Apex and Flink) this problem does not exist. The equivalent of the task in Spark Streaming (in Apex that’s an operator) is initialized only once, at launch time. Subsequently, state can be accumulated and held in-memory as long as it is needed for the computation. And access to that memory is fast, which helps in further reducing the latency (in addition to not having to wait for a batch of input to accrue).

So what about fault-tolerance? The streaming platform is responsible for checkpointing the state. It can do so efficiently and provides everything needed to guarantee that the state can be restored and is consistent in the event of failure. Unlike the early days of Apache Storm with its per tuple processing and acknowledgement overhead, the next generation streaming architectures come with fault tolerance mechanisms that do not compromise performance and latency. How Apex solves this will be covered in Chapter 5, Fault Tolerance and Reliability. However, users of the Apex platform do not need to be concerned with fault-tolerance, and instead can focus on their business logic.

Fault tolerance

Apex was built from the ground up to enable data processing pipelines that are highly available and provide strong processing guarantees for accurate results. From its very first release in 2013, Apex supported exactly-once semantics based on distributed checkpointing and full fault-tolerance, including the application master. In the book we will look how these important properties are achieved and what different components in the stack contribute to it:

The need for the distributed data processing platform to be resilient
Failure scenarios in Apex and how they are handled
Consistent, distributed checkpointing and how it works
Efficient, incremental, large-scale state saving
Why is accuracy of the processing important, what guarantees does Apex provide
An example application for end-to-end exactly-once results

Apex is a reliable platform for stream processing that is highly available and provides accuracy of results. Two of the most important aspects for production quality stream data processing are fault tolerance and correctness guarantees and these turn out to be strong points and key differentiators of Apex.

Performance

Even with big data scale out architectures on commodity hardware, efficiency matters. Better efficiency of the platform lowers cost. If the architecture can handle a given workload with a fraction of the hardware, then it will result in reduced Total Cost of Ownership (TCO). Apex provides several advanced mechanisms to optimize efficiency, such as stream locality and parallel partitioning, which will be covered later in the book.

Apex is capable of very low latency processing (< 10 ms), and is well suited for use cases such as the real-time threat detection. Apex can be used to deliver latency Service Level Agreement (SLA) in conjunction with speculative execution (processing the same event multiple times in parallel to prevent delay) due to a unique feature: the ability to recover a path or subset of operators without resetting the entire DAG.

Only a fraction of use cases may have such latency requirements. However, it is generally desirable to avoid unnecessary trade-offs. If a platform can deliver high throughput (millions of events per second) with low latency and everything else is equal, why not choose it over one that forces a tradeoff? Benchmarking studies have shown Apex to be highly performant in providing high throughput while maintaining very low latency.

Overall, Apex has characteristics that positively impact time to production, quality, and cost. It is a particularly good fit for use cases that require:

High performance and low latency, possibly with SLA
Large scale, fault tolerant state management and end-to-end exactly-once processing guarantees
Computationally complex production pipelines where accuracy, functional stability, security and certification are critical and ad-hoc changes not desirable.

We hope that gives us all a great overview of Apache Apex and its benefits. You can get to know more from our book, Learning Apache Apex.