Choosing a Reactive Framework for the JVM

Published in

priceline labs

10 min readJun 10, 2019

The JVM community is experiencing a Cambrian explosion of reactive frameworks. So how do you know if you need a reactive framework at all? And if you do, which reactive framework? At priceline, we faced these questions recently in three different teams and each team chose a different reactive framework. This blog post will convey the information our teams used to make their own choices. There are many excellent articles explaining what a reactive framework or system is, and we suggest you read one or more of them if you want to brush up on the fundamental definitions.

Priceline’s team building a new booking engine needed excellent resiliency and support for an event-driven domain model with a lot of shared state. For example, a vacation package booking might include a flight, a couple of hotel reservations, and a car reservation. Reserving each of these items requires one or more requests to a different supplier with a lot of variability between the supplier APIs. Some supplier APIs are synchronous, some asynchronous, some are fast, and some are slow. The entire booking process for one vacation package can actually span multiple days so the shared state must be maintained and survive server reboots during that time.

Priceline’s team building an API to serve static content for hotels knew that the Hotel Content API would be IO-intensive, need to combine multiple streams of data, and would have little to no shared state. The Hotel Content API team prioritized low latency and high scalability. The team had a lot of experience with Spring and wanted to leverage Spring for a quick delivery on the initial release.

Priceline’s team building a Hotel Pricing API had very similar requirements to the Hotel Content API team above, but the Hotel Pricing team wanted to use Scala and was not as eager to use Spring as the Hotel Content team was.

All teams saw debate and confusion in various blog posts and experience reports about what benefits and costs to expect when using a reactive paradigm. It helped us to understand exactly what problems the different reactive frameworks were designed to solve and to take a look at the experience reports of other systems that took the reactive approach. So let’s put some context around the reactive framework choice by briefly reviewing a timeline of the challenges and solutions that brought the reactive frameworks to where they are today.

1998 — Ericsson uses the Actor Model for unprecedented up-time

Ericsson achieved an impressive 99.9999999% up-time using the Actor Model, designed in 1970 by Carl Hewitt, implemented in Erlang. This implementation running on the Erlang virtual machine, BEAM, is widely considered to be the first popular, successful reactive framework. It was developed to support the Ericsson phone network, where system resiliency was critical.

1999 — Dan Kegel sets new capacity goals for web servers

Dan Kegel coins the “C10K Problem.” Kegel’s complaint was that web servers running Apache were falling over when the number of simultaneous connections approached 10,000. He was looking for better scalability. Linux developers noted a few things about the way the kernel handled networking at the time that contributed to this problem:

Apache was using 1 thread (OS process) per network socket connection (file descriptor). For each packet that came in, the kernel had to inspect every process to see where to route the packet. Also, loading and unloading threads burns a lot of CPU due to context switching.
Another networking API (select/poll) was available to multiplex connections over fewer threads but wasn’t much better because for each packet that came in, the kernel still had to inspect every file descriptor.

In both cases, the software was using a thread scheduler as a packet scheduler, which would never deliver the scale modern applications demand.

2002 — LINUX releases epoll API

The 2.5.44 release of LINUX included a new API: epoll. The epoll API delivered effectively constant socket look-up time. If networking software used epoll and multiplexed connections across a handful of threads at most (as opposed to 1 thread per request), one could expect significantly better resource utilization on a server and handle 10K simultaneous connections well. This solution reduced the latency of packet routing within Linux, which enabled better scalability of open connections.

2004 — First web servers built on top of epoll

NGINX and Netty, both non-blocking web servers, release their first versions, built on top of epoll. Their main goal was to improve scalability.

2009 — Node.js is released

Node.js, an asynchronous event-driven JavaScript runtime, releases its 1st version, also built on top of epoll. The designers of node.js decided to use a single thread for the main event loop which reduced the need for locks within applications built for it. The reduction in lock usage is a major factor contributing to its reputation for having low latency. The use of epoll is a major factor contributing to its scalability.

2010 — Akka is released

Akka, a toolkit for building highly concurrent, distributed, and resilient message-driven applications for Java and Scala, is first released. Akka was aiming for horizontal scalability and resilience, with its use of the Actor model, similar to Erlang above. Akka actors also aimed to support event driven domains.

2011 — LMAX processes 6 million TPS on 1 thread and Vert.x is released

LMAX builds a proprietary reactive system for trade processing and claims it processes 6 million transactions per second on a single thread. They prioritized low latency and used special lock-free data structures as well as an asynchronous design to model their event driven domain.

Vert.x, a polyglot toolkit for building reactive applications on the JVM is first released.

2012 — New reactive APIs for JVM and WhatsApp’s success with Erlang

RxJava, a library for composing asynchronous and event-based programs, is first released. RxJava was aiming for scalability and support for event driven domains.

Non-blocking Java Servlet API is released.

WhatsApp handles 2.8 million simultaneous connections per host using Erlang actors. Their non-blocking design aimed to support their growth by scaling out more cost effectively and improving resiliency.

2016 — Netflix goes reactive on the JVM

Netflix’s Zuul goes reactive. The designers expected, and observed, scalability improvements particularly with respect to persistent connections. They had hoped to see other performance improvements but only observed efficiency gains in systems that were io-bound, not in systems that were CPU-bound.

Reactive Themes

You probably will have noticed the emphasis I placed above on 4 themes: scale, latency, resilience, and event driven domains. Let’s take a look at how a reactive approach addresses each of these themes:

Scale

Reducing the total number of threads your application uses will free up memory within your JVM (recall, 1 thread in the JVM consumes just over 1MB) and in the OS. This will also reduce the context switching burden on the CPU. If your aim is to handle more than 10 thousand simultaneous requests per server, using a reactive paradigm that leverages epoll will be critical. Applications with persistent front-end HTTP connections (WebSockets) often find themselves wrestling with this challenge.

Latency

Reducing the total number of threads and leveraging epoll can help with latency, but success in reducing latency often requires careful attention to lock usage as well. The team at LMAX originally started with an actor model for their system, but found that they needed to switch to entirely lock-free data structures to meet their performance requirements. Similarly, Netflix didn’t find significant latency improvements only in migrating Zuul to a reactive framework based on Netty. So using a reactive framework can help reduce latency, but additional, significant tuning effort will likely be required.

Resilience

Many reactive frameworks make back-pressure a first-class citizen in their APIs. Often, applications that are dealing with “big data,” “fast data,” or real-time data streams struggle with resilience and benefit from explicitly handling back-pressure. Akka can even transparently transfer requests to other nodes in a cluster (they call this location transparency) when one node is overwhelmed. It can also grow and shrink the cluster on demand.

Event Driven Domains

If you think the most natural representation of your domain is one that will be dominated by events, you may find a reactive framework to be a good fit. A highly interactive front end with a constant interplay of events streaming between producers and consumers may nudge your back-end design to an event driven model. At priceline, much of what happens to a given booking in the real world happens asynchronously and this was one of the principal reasons we chose Akka as the reactive framework for our booking engine.

Free Lunch?

Everyone wants scalability, low latency, and resilience, so should any and all applications go reactive? Well, there are some drawbacks to consider before diving in. We’ll briefly mention here that when you begin multiplexing requests across only a handful of threads, race conditions can crop up. There may be more shared state that requires synchronization and your call stacks will not be as helpful. There are no thread local variables to pass around a request context, and of the frameworks we use at priceline, only Project Reactor provides a 1st class alternative. The Netflix Zuul post covered the drawbacks well and in greater detail, so if you want a deeper dive into the drawbacks that’s a good place to turn.

As always, context matters. If your scale is already bottlenecked by some other part of your system, maybe a database or a really big heap, going reactive won’t change much.

Examples

This article is a tour — not a tutorial. We will show you different features from the most common reactive styles to help you determine which style would be a good fit for your project. We will show you some of the nicer features of Project Reactor and Akka since those are the two reactive frameworks we use most at priceline. We’ll see how both offer support for asynchronous streams and how Akka supports the actor model.

Streams or Actors?

I always advise teams choosing between reactive frameworks to first try a framework based on streams. Streams are type-safe, they compose easily, and they make the data-flow within an application easy to see. However, if an application has a lot of shared state to manage or if the state of the data-flow needs to be persisted, Actors may be a better fit. The priceline booking engine had a lot of shared state and persistence challenges, so we used Actors for the booking engine. Another article on the details of the booking engine implementation will follow.

Streams — Project Reactor

The Hotel Content API chose Project Reactor as its reactive framework. Project Reactor describes itself as “A fourth-generation Reactive library for building non-blocking applications on the JVM based on the Reactive Streams Specification.” At priceline, we really like its seamless integration with Spring and the support it has for a request context. The ramp-up time is also relatively low, particularly if you’re a Java developer who has been using Java Streams for a while. They are similar:

However, with reactive streams, there is throttling, back-pressure, and a more flexible lifecycle that allows multiple subscribers to come and go. Here’s what back-pressure can look like in action when consuming from RabbitMQ:

This little example shows the RabbitMQ library only taking 50 messages at a time. No more messages will be pulled from the broker until the subscriber has processed the previous batch of 50.

Streams — Akka

Akka also has a streaming API, and this is what the Hotel Pricing API team chose as a reactive framework. In our experience, it takes a bit longer to come up to speed on Akka streams than on Reactor streams because you’ll need to learn a bit about actors, which are the underpinnings of the Akka streaming API. Here’s the classic factorial sequence taken from the Akka Streams tutorial:

Note the back-pressure APIs on lines 3 and 5.

Actors — Akka

Streams are ideal for describing the flow and transformation of data through a system. If someone is looking for a reactive framework I usually recommend streams as the first approach. However, there are some features supplied by Actors that can be quite useful and that streams do not (yet) provide. It was these features that the booking engine team found useful when choosing Akka Actors for its reactive framework.

The first benefit of actors is the concurrency guarantee they provide for shared, mutable state. Let’s take a look at a diagram to understand that a little more:

Every actor is a Java or Scala object with its own mailbox. The Akka framework guarantees that exactly one thread will take a message out of an actor’s mailbox and call the handle message function on it at any given time. Because of this guarantee, you can put mutable state inside of an actor and never again have to worry whether you have handled the synchronization correctly. This is a very powerful and simple guarantee that you do not get with streams. If you have shared, mutable state in your streaming API you have to look for tools outside of streams to help you make it thread-safe. Let’s look at a code sample:

As you can see from the example above, and unlike streams, Akka Actors are effectively dynamically typed since an actor can receive a message of any type. Actors also do not compose the way that streams do.

The second big benefit of Akka Actors is that it comes with a lot of support for persistence. Given that actors are really meant to encapsulate state, it’s not surprising that it will often be useful to save their state. Akka Actors have storage agnostic APIs to persist themselves and there are many integrations available.

Akka also supports location transparent clustered actors and has a nice implementation of Finite State Machines.

Summary

Adoption of a reactive programming style has been increasing since the late 1990s and shows no signs of slowing down. In fact, as cloud adoption makes pricing based on utilization a reality we have yet another reason to be efficient with our hardware usage. Here is a table comparing the 3 frameworks priceline uses that you may find helpful in making your own choice.