Monitor akka streams graph

PriceRunner
Jul 8, 2020 · 4 min read

Background

We have multiple applications where we consume messages from Kafka and process them with the help of Akka streams as the underlying streaming framework. Some of these applications are rather simple in their implementation while others have multiple stream operators (both asynchronous and synchronous). Some of these operators are simple, maybe only performing a simple filter, while others are more complex and may involve integration to underlying applications. This in turn entails that the different operators infer anywhere from <1ms to >100ms to the overall process time.

The reason we find it valuable to monitor our Akka stream graphs is to help us discover potential bottlenecks. To only measure an overall process time or throughput is not enough in order to detect bottlenecks. We need a more fine grained view of the stream.

So, to solve the problem of discovering bottlenecks in an Akka streams graph we first have to answer the question “what is a bottleneck?” and then we need to figure out how to detect it. I will present a solution that we have come up with in my team and that we are currently evaluating.

Where are the bottlenecks

Akka streams has built-in support for backpressure. This means that downstream operators in a stream signal the upstream that they are not able to process objects at the same rate they are available from upstream operators. An obvious bottleneck is where a stream operator is applying backpressure upstream. At that point it is obvious that the operator is not able to cope with the load it is being exposed to.

The next question to find an answer to was “how do we identify where backpressure is applied?”. We began by defining two metrics:

  • Upstream latency
    The time between downstream operator requests a new object to process until the new object is supplied by upstream operator.
  • Downstream latency
    The time between upstream operator pushes a new object downstream until the downstream operator requests a new object.

An additional, supporting metric, is throughput (objects passed through a cross section of the stream per unit of time). The idea behind these metrics is to detect where backpressure is applied. The theory is that if downstream latency is high (i.e. downstream takes a lot of time to process the object before it requests a new from upstream operator) and upstream latency is low (i.e. upstream is ready to supply new objects as soon as downstream requests it) we have detected backpressure and the downstream operator might be subject to investigation and performance tuning. A high downstream latency must of course be tracked throughout the entire stream since the high latency might be inferred by an operator further down in the stream. It is paramount to track the source of the high latency.

Akka stream implementation

So, how did we go about to implement the measuring of these metrics in the akka streams graph? We created a custom GraphStage we call an ObservationPoint . The GraphStage API is Akka’s abstraction layer over key concepts of the Reactive streams specification. This ObservationPoint is meant to be positioned between two stream operators as depicted below

Observation point between two operators that are individually parallelized.

I will not present the code in its entirety here because I believe the concept and idea is more interesting and more prone to errors and hence more relevant to discuss. The code is rather simple. A custom GraphStage in Akka streams works like the picture below

The solid square represents our ObservationPoint which is implemented as a custom GraphStage

The logic:

  1. onPush is invoked on the ObservationPoint when the upstream operator pushes a new object.
  2. The object is immediately pushed further down and the timestamp for the last push is recorded.
  3. onPull is invoked on the ObservationPoint when downstream is ready to accept a new object.
  4. This pull request is forward upstream and a new object is pulled from upstream and the timestamp for the last pull is recorded
  • Upstream latency calculation: lastPushlastPull
  • Downstream latency calculation: lastPulllastPush

We also record the number of objects that flows through the ObservationPoint in order to measure throughput. By position a number of observation points throughout the stream we are able to diagnose our stream. The resulting code could look like this

Two observation points (A & B) positioned between the business logic flows.

Johan Wiström
Backend tech lead

PriceRunner Tech & Data

What is happening within Tech and Data at PriceRunner