Why Apache Heron? Part 2
This is part 2 of a series of blog posts that highlights key features of Apache Heron (incubating*), a next-generation stream processing system developed at Twitter. In part 1 of the series, we discussed how Heron provides multi-tenancy, native support for containers, resource isolation at all levels, a flexible resource allocation model, and flexible deployment scenarios, and we showed how some of these features allow for greatly improved developer productivity. In this post, we’ll continue the discussion by showing how Heron achieves seamless support for different processing semantics, how its modular design makes it efficient, extensible, and scalable, and we’ll also shine light on its security model, multi-language API, and easy operability.
Seamless support for different processing semantics
When we say “delivery semantics,” what we man is the guarantee that tuples — the core data type underlying Heron topologies — are delivered to Heron applications by the underlying streaming system. Different streaming applications require different types of processing guarantees. For some applications, a certain minimal amount of data loss might be tolerable, for example an application computing Twitter trends. A billing application, on the other hand, would require zero data loss. In order to accommodate different requirements, Heron supports three types of delivery semantics: at most once, at least once, and effectively once.
- At-most-once semantics — In this mode, Heron provides best-effort delivery of tuples. If there are failures, such as hardware failures, some tuples might be lost and will not be retried. This tends to be best for applications for which a certain amount of data loss is acceptable.
- At-least-once semantics — When running in this mode, Heron will guarantee that tuples will be processed by the application at least once. Heron keeps track of the tuples as they are produced by different topology components and ensures that they are all retired at the end of processing. If failures and/or timeout occur, Heron will retry those tuples.
- Effectively-once semantics — This mode, commonly referred to as “exactly-once” by many other streaming platforms (please stay tuned for a follow-up blog post discussing the differences), guarantees that tuples that cause state changes are effectively processed once. To be more concrete, Heron guarantees that each individual tuple will contribute to any state change only once. Heron will ensure, via its distributed state checkpointing mechanism, that if a tuple fails to be processed, the state of all components in the application will be rolled back to a previous successful checkpoint. Heron will also start reading from incoming data streams at the correct offset that corresponds to the checkpoint that all the states in the application were rolled back to.
A streaming system should be highly efficient. If the system is inefficient, it will use more resources to achieve the same throughput and efficiency as a more efficient system. Heron was designed with efficiency in mind right from the start for large data processing environments. Based on detailed performance evaluation in production settings, Heron can deliver throughput of 170 million tuples per second on 17 machines, with each machine consisting of 12 cores and 2 hyper-threads per core, resulting in 24 hyper-threads processing data.
Since the amount of data in the world is growing by an estimated 40% per year, enterprises need to have systems and solutions in place to accommodate this growth. Scaling a system requires fundamentally different thinking in system design and how to efficiently use resources. Heron was designed for high scalability along multiple dimensions:
- in the ability to run large streaming topologies with a large number of containers
- in the ability to run a topology with a large number of components
- in the ability to process a lot of data
Heron has demonstrated scalability in all three dimensions in production, running topologies consisting of 800–1000 containers each as well as topologies consisting of 50–60 components in each of the topology’s directed acyclic graphs (DAGs).
Support for mixed workloads
Heron topologies can be used for a highly diverse set of streaming applications. Hence, the SLA requirements for these applications can vary. A critical stock trading application, for example, might require a few milliseconds of latency, while an application that computes Twitter trends might require high throughput with a few seconds of latency. Heron provides a rich set of parameters that allows you to fine-tune performance in accordance with SLA requirements on a per-topology basis. Unlike other systems that depend on a single set of parameters for all processing jobs, Heron topologies can be tuned on a topology-by-topology basis, which ensures that changing the parameters for one topology won’t affect other topologies. Several topologies with different SLA-derived parameters can thus easily co-exist in the same cluster.
Modular design and extensibility
The “Big Data” ecosystem is highly complex, and continuous innovation has led to a profusion of infrastructure components. In this continuously evolving ecosystem, we wanted to ensure that Heron can adapt and still retain its well-tested, battle-hardened, productionized core. Heron was designed with extensibility in mind, and pretty much any component can be replaced with a different implementation. For example, a state manager that provides distributed synchronization provides multiple implementations, one for Apache Bookkeeper and another for local filesystems. Similarly, the checkpoint manager for storing snapshots provides implementations for Hadoop and again for local filesystems.
Let’s say that you want to use etcd as a state manager. You can add an etcd implementation using the well-defined state manager interface and the rest of the system remains unchanged. Similarly, if you want to use Ceph for storing checkpoints, you can add another checkpoint manager implementation and the system will start using the implementation if you apply the proper config setting.
Heron provides two tools for managing topologies: Heron Tracker and Heron UI. Heron Tracker provides REST API endpoints for querying the state of topologies and for fetching operational metrics. The REST API facilitates writing custom tools to provide visibility into the state of topologies. In fact, Heron UI uses the Heron Tracker REST API. Furthermore, each topology emits a wide variety of metrics using a process called the metrics manager. The metrics manager has several plugins for exporting metrics to third party systems. The currently supported plugins:
Heron topology metrics are continuously monitored to automatically identify issues and problems that might arise in running topologies. These issues and problems are bubbled up to the user so that timely action can be taken.
Diverse use cases
Heron is currently being used for a wide range of use cases across multiple organizations, such as Twitter, Google, Microsoft, and ndustrial.io. This broad set of use cases includes but is not limited to:
- Real-time extract transform load (ETL) operations
- Real time business intelligence
- Real-time product safety and spam detection
- Real-time fraud and anomaly detection
- Real-time trend discernment
- Real-time machine learning
- Real-time media
- Real-time operations
- Real-time log processing
- Real-time model predictive control
These use cases have evolved over a period of time and via feedback about Heron. Heron features have been continuously expanded to accommodate new use cases.
Proven in production
Heron has been running in production for more than three years at Twitter. Heron currently processes around 250 billion events per day in production and is being used by numerous internal services. It has processed more than 275 trillion events up to date in total and it runs on thousands of machines powering several critical applications at Twitter.
Heron is a next-generation streaming system designed to address several shortcomings in other open source streaming systems. In this two-part series, we’ve highlighted several features of Heron that make it more than ready for enterprise-wide deployment. In this post, we summarized additional capabilities such as multiple processing semantics, scalability, efficiency, support for mixed workloads, operational maturity, and how its modular design helps in quickly adding new functionality.
If you’re interested in Heron, you may want to participate in the community via:
- The Heron Slack channel (you can self-register at https://heronstreaming.herokuapp.com)
- The Heron email list
*Apache Heron is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by Apache Incubator PMC. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.