Why Apache Heron? Part 2

Seamless support for different processing semantics

When we say “delivery semantics,” what we man is the guarantee that tuples — the core data type underlying Heron topologies — are delivered to Heron applications by the underlying streaming system. Different streaming applications require different types of processing guarantees. For some applications, a certain minimal amount of data loss might be tolerable, for example an application computing Twitter trends. A billing application, on the other hand, would require zero data loss. In order to accommodate different requirements, Heron supports three types of delivery semantics: at most once, at least once, and effectively once.

  • At-most-once semantics — In this mode, Heron provides best-effort delivery of tuples. If there are failures, such as hardware failures, some tuples might be lost and will not be retried. This tends to be best for applications for which a certain amount of data loss is acceptable.
  • At-least-once semantics — When running in this mode, Heron will guarantee that tuples will be processed by the application at least once. Heron keeps track of the tuples as they are produced by different topology components and ensures that they are all retired at the end of processing. If failures and/or timeout occur, Heron will retry those tuples.
  • Effectively-once semantics — This mode, commonly referred to as “exactly-once” by many other streaming platforms (please stay tuned for a follow-up blog post discussing the differences), guarantees that tuples that cause state changes are effectively processed once. To be more concrete, Heron guarantees that each individual tuple will contribute to any state change only once. Heron will ensure, via its distributed state checkpointing mechanism, that if a tuple fails to be processed, the state of all components in the application will be rolled back to a previous successful checkpoint. Heron will also start reading from incoming data streams at the correct offset that corresponds to the checkpoint that all the states in the application were rolled back to.

Efficiency

A streaming system should be highly efficient. If the system is inefficient, it will use more resources to achieve the same throughput and efficiency as a more efficient system. Heron was designed with efficiency in mind right from the start for large data processing environments. Based on detailed performance evaluation in production settings, Heron can deliver throughput of 170 million tuples per second on 17 machines, with each machine consisting of 12 cores and 2 hyper-threads per core, resulting in 24 hyper-threads processing data.

Scalability

Since the amount of data in the world is growing by an estimated 40% per year, enterprises need to have systems and solutions in place to accommodate this growth. Scaling a system requires fundamentally different thinking in system design and how to efficiently use resources. Heron was designed for high scalability along multiple dimensions:

  • in the ability to run large streaming topologies with a large number of containers
  • in the ability to run a topology with a large number of components
  • in the ability to process a lot of data

Support for mixed workloads

Heron topologies can be used for a highly diverse set of streaming applications. Hence, the SLA requirements for these applications can vary. A critical stock trading application, for example, might require a few milliseconds of latency, while an application that computes Twitter trends might require high throughput with a few seconds of latency. Heron provides a rich set of parameters that allows you to fine-tune performance in accordance with SLA requirements on a per-topology basis. Unlike other systems that depend on a single set of parameters for all processing jobs, Heron topologies can be tuned on a topology-by-topology basis, which ensures that changing the parameters for one topology won’t affect other topologies. Several topologies with different SLA-derived parameters can thus easily co-exist in the same cluster.

Modular design and extensibility

The “Big Data” ecosystem is highly complex, and continuous innovation has led to a profusion of infrastructure components. In this continuously evolving ecosystem, we wanted to ensure that Heron can adapt and still retain its well-tested, battle-hardened, productionized core. Heron was designed with extensibility in mind, and pretty much any component can be replaced with a different implementation. For example, a state manager that provides distributed synchronization provides multiple implementations, one for Apache Bookkeeper and another for local filesystems. Similarly, the checkpoint manager for storing snapshots provides implementations for Hadoop and again for local filesystems.

Operational maturity

Heron provides two tools for managing topologies: Heron Tracker and Heron UI. Heron Tracker provides REST API endpoints for querying the state of topologies and for fetching operational metrics. The REST API facilitates writing custom tools to provide visibility into the state of topologies. In fact, Heron UI uses the Heron Tracker REST API. Furthermore, each topology emits a wide variety of metrics using a process called the metrics manager. The metrics manager has several plugins for exporting metrics to third party systems. The currently supported plugins:

Diverse use cases

Heron is currently being used for a wide range of use cases across multiple organizations, such as Twitter, Google, Microsoft, and ndustrial.io. This broad set of use cases includes but is not limited to:

  • Real-time extract transform load (ETL) operations
  • Real time business intelligence
  • Real-time product safety and spam detection
  • Real-time fraud and anomaly detection
  • Real-time trend discernment
  • Real-time machine learning
  • Real-time media
  • Real-time operations
  • Real-time log processing
  • Real-time model predictive control

Proven in production

Heron has been running in production for more than three years at Twitter. Heron currently processes around 250 billion events per day in production and is being used by numerous internal services. It has processed more than 275 trillion events up to date in total and it runs on thousands of machines powering several critical applications at Twitter.

Conclusion

Heron is a next-generation streaming system designed to address several shortcomings in other open source streaming systems. In this two-part series, we’ve highlighted several features of Heron that make it more than ready for enterprise-wide deployment. In this post, we summarized additional capabilities such as multiple processing semantics, scalability, efficiency, support for mixed workloads, operational maturity, and how its modular design helps in quickly adding new functionality.

Acknowledgements

We would like to acknowledge Viktor Klang for coining the term “effectively once.”

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Karthik Ramasamy

Karthik Ramasamy

Karthik is co-founder and CEO of Streamlio, a modern platform for connecting, processing and storing fast-moving data built on Apache Pulsar.