Why Apache Heron? Part I

Karthik Ramasamy
6 min readOct 24, 2019

--

Apache Heron (incubating*) is a distributed stream processing engine developed at Twitter to address several shortcomings encountered when running Storm in production (Storm was open sourced by Twitter in 2011). Twitter announced Heron in June 2015 and open sourced it in May 2016. Heron is a battle-tested, production-ready, real-time stream processing engine powering real-time use cases in several large organizations. Unlike other streaming systems, Heron fundamentally uses a different architecture. In this first part of a two-part series, we will highlight the key features of Heron and point out why Heron is the right choice for a variety of use cases.

In a previous post, we introduced several Heron concepts and terminology. To recap, a streaming job or application that you run in Heron is called a topology. A topology consists of spouts and bolts: spouts inject streams of data into the topology while bolts perform computations on that data. Now let’s take a deep dive into the features of Heron that set it apart from other streaming systems.

Developer Productivity

Heron was developed to address several shortcomings of existing open source streaming systems. First, Heron topologies are very easy to troubleshoot and debug. When a streaming job (topology) misbehaves for any number of reasons, such as misbehaving user code, failing hardware, or even changes in load, it is important to determine the root cause quickly, since downtime could lead to loss of revenue and a variety of other headaches. Heron provides UI and tracking tools for easily locating the logs of a particular instance of a spout or bolt and for identifying the sequence of events that led to the error. Furthermore, Heron topologies continuously monitor themselves and automatically notify the developer when something is not right. This helps developers quickly identify where the underlying issue lies and to take timely action.

Multi-tenancy

Right from the start, Heron was designed to support multi-tenancy and thus to provide solutions to problems unique to larger organizations with many teams, divisions, use cases, etc. Heron topologies can run alongside other critical jobs in a shared cluster. Heron topologies are identified by team so that each team can manage their own topologies and enforce resource limits provided by the scheduler. Furthermore, Heron can manage multiple clusters within a datacenter and also across geographically dispersed datacenters.

Use of Containers

The advent of containers has revolutionized the way software is packaged, distributed, and run. A container is a lightweight, standalone, executable software package that includes code, runtime, system tools and libraries, and the settings needed to run all of the above. Right from the start, Heron was designed to run in containers and to support different containerization technologies, including Linux cgroups and Docker containers. In fact, a Heron topology is essentially a set of containers that can be scheduled by a scheduler (such as Kubernetes and Mesos/DCOS). This allows Heron to run in multi-tenant clusters and in isolation without interference from other containers running other Heron topologies or from some other service. A Heron topology specifies the resource for each container in terms of cores, memory, and disk space and also the total number of containers needed. The scheduler allocates these resources during topology submission, before the topology is run.

Isolation

Each Heron topology can be composed of several spouts and bolts and each spout or bolt can have several instances. These instances are packed into containers that are run by the scheduler. Heron provides isolation at the topology level, the container level, and the instance level. Each topology is on its own and any effects of one topology do not affect the others. Similarly, containers that comprise each topology are independent and interact with each other using a well-defined protocol. Each container can be killed and restarted independently and the topology will automatically recover. Similarly, each instance of a spout or bolt does not interfere with other instances since they run inside their own processes. Each instance could die and come up again without affecting the others. Such isolation allows for easy debugging, granular performance modeling, and resiliency.

Use of containers and isolation is shown in Figure 1. In the figure, there are four nodes (machines) that are running containers of different topologies. Topology 1 is running in three containers in nodes Node 1, Node 2, and Node 3, whereas Topology 2 runs in four containers with one container per node.

Figure 1. Topologies and their containers running in isolation when sharing machines

Multi-language API

Currently, Heron provides two fundamentally different APIs for defining stream processing logic: a procedural API and a functional API. Both of these APIs are available in two popular languages: Java and Python. The procedural API for Heron requires you to explicitly create spouts and bolts, while the functional API enables you to create stream processing logic in a style more reminiscent of functional programming, with operations like maps, flatMaps, filters, and the like.

The Java procedural API is compatible with Apache Storm while the Python procedural API is compatible with the popular Streamparse Python API for Storm from Parsely. The functional API is supported in Java and Python as a domain-specific language (DSL). One of the major advantages of Heron is that it’s able to support these new functional APIs due to its extensible architecture and due to its support for language-specific instances.

Flexible Resource Allocation

Each Heron topology requires resources in the form of CPU, memory, and disk, as well as a network that topology components can use to communicate with one another. Heron provides multiple ways to express the resource needs of a topology. If you are interested in just getting the topology running, you can specify the number of containers and Heron will provide the rest in terms of defaults. If you want to use your resources more efficiently, however, you can allocate resources at the component level, where you can specify precisely how much CPU, memory, and disk space you’d like a spout or bolt instance to use.

Flexible Deployment

Heron can be run in multiple different modes: local mode and cluster mode. In cluster mode, Heron can co-exist with other frameworks in an existing cluster. Alternatively, you can deploy Heron in local mode in a dedicated cluster. For such deployments, a scheduler needs to be installed. Heron supports multiple schedulers:

Adding a new scheduler to Heron is very easy due to the modular design and architecture. Furthermore, Heron can be deployed in server mode or serverless mode.

  • Server mode will run an API server that is used for submitting, killing, activating, deactivating, and updating topologies. The API server manages all configuration and runs as yet another job in the scheduler.
  • Serverless mode does not require the API server. Instead, the Heron client on the user’s machine manages the config and submits the job to the underlying API.

Conclusion

Heron is a next-generation stream processing system that uses a fundamentally novel architecture to address the shortcomings of several open source systems. In this blog post, we discussed in detail some key differentiators for Heron in terms of developer productivity, multi-tenancy, isolation, language API, use of containers during run time, flexible resource allocation model, and flexibility of deployment. In the next blog post, we’ll examine other features in detail, such as processing semantics, efficiency, scalability, and many more.

If you’re interested in Heron, you’ll want to participate in:

And of course for more information about the Heron project in general, go to heron.apache.org and follow the project on Twitter @heronstreaming.

*Apache Heron is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by Apache Incubator PMC. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

--

--

Karthik Ramasamy

Karthik is co-founder and CEO of Streamlio, a modern platform for connecting, processing and storing fast-moving data built on Apache Pulsar.