Running a Lagom microservice on Akka Cluster with split-brain resolver

Published in

StashAway Engineering

6 min readDec 2, 2017

Akka Cluster split-brain problem solutions — without ConductR

TL;DR

The split-brain problem that describes a state of network partitions can be addressed using readily available open-source solutions in Akka Cluster. Depending on your cluster configuration, you might be able to find a workable solution without too much effort.

https://github.com/mbilski/akka-reasonable-downing worked reasonably well for our use-case (pun intended).

Introduction

At StashAway we are running two separate stacks: one Frontend stack based on Node.JS and React, and a Backend stack based on Lightbend’s microservice framework Lagom which promotes Event Sourcing and CQRS (https://www.lagomframework.com/).

Our Backend services are mostly offline, meaning they connect to the banks and brokers and process data mostly in batch-style. However, some of the microservices that handle our customer portfolio data (e.g. to show recent transactions and holdings inside a portfolio) are just proxied through our Frontend stack and therefore critical to the uptime of our app.

To ensure high-availability of those services (uptime, no-downtime deployments & scalability), we have for a long time wanted to upgrade them into a cluster setup. Lagom natively runs on Akka Cluster which therefore conveniently provides a cluster setup out of the box.

We are however not using ConductR (https://conductr.lightbend.com/) which is the commercial product of Lightbend and helps running Lagom in production. One of the key features that ConductR offers is a resolver for the so-called “Split-brain problem”.

The split-brain problem

The split-brain problem is a prevalent problem in cluster architectures and describes the state of a system that got broken apart in two or more network partitions that are unable to communicate with each other.

The resulting worst-case scenario is that both of those partitions form individual clusters and start running in parallel, leading to the duplication of Singletons and race-conditions when it comes to the writing of data which can ultimately lead to inconsistencies in the event log.

Lagom uses Akka persistence as the underlying persistence layer which handles the sharding of persistent entities across the cluster. In simplified terms, it will hash an entity’s ID across the available cluster nodes and distribute the computing horizontally amongst those nodes. In a split-case scenario an entity might receive commands on Node 1 and Node 2 in parallel and issue events that lead to an inconsistent state of this entity.

As consistency is one of the primary objectives of our Trading system, such a split-brain scenario must therefore be avoided at all cost.

Strategies for split-brain resolution

When you read about split-brain resolution strategies you can usually find the recommendation to either resolve a split-brain situation manually or automatically. Looking at our requirements it is pretty obvious, that a manual resolution would not be sufficient with regards to our uptime and consistency requirements.

Akka offers a feature for automatic resolution which is called “Auto-downing”, but it should never be used in production! (not sure why it is even available) The auto-downing feature lets you configure a time-out after which an unreachable node will be “downed”, ie. removed from the cluster. The problem with that is that in a network partition scenario, you will in fact not be able to tell the unreachable node to shut down, but you merely remove it from one side of the network partition. The other side of the partition will do exactly the same, and voila — you have two “healthy” clusters that continue to run in a split-brain modus.

For a cluster with a static number of nodes, there is however an easy way out: You define a quorum, in other words the number of nodes that constitute a “healthy” cluster. In a cluster of 3 nodes, your quorum would be 2. In a cluster of 5 nodes, the quorum is 3, etc. Now, each side of the network partition can decide whether or not it “owns” the majority of the nodes, and can shut itself down if it does not.

Reinventing the wheel?

It would be possible to write this logic on your own, but as we are running a production system and not just doing a research project, we want to use our time as efficiently as possible. We therefore investigated the two first open-source solutions that a quick Google search reveals:

https://github.com/TanUkkii007/akka-cluster-custom-downing/
This library explains nicely the different approaches one can take towards the split-brain resolution. However, it comes with a lot of code that we don’t really need, and it took us a bit too much time to reason about the correctness of its code.

https://github.com/mbilski/akka-reasonable-downing
I already like the name of this library :) it does only one thing, and it does it very well: implement the quorum-based shut-down of a minority-cluster. The relevant code is contained in a single class and very easy to comprehend. We were however worried about one particular line of code:

} else if (cluster.state.unreachable.nonEmpty && isLeader) {

Pay attention to the second part of the if-condition, the check whether or not the node is a leader. Let’s first take a look at how Akka leader election works.

Akka Cluster leader election

Akka uses a Gossip Convergence mechanism to disperse information about the cluster topology among its nodes. From the docs (https://doc.akka.io/docs/akka/current/common/cluster.html?language=scala#leader)

After gossip convergence a leader for the cluster can be determined. There is no leader election process, the leader can always be recognised deterministically by any node whenever there is gossip convergence. The leader is just a role, any node can be the leader and it can change between convergence rounds.

And:

Gossip convergence cannot occur while any nodes are unreachable. The nodes need to become reachable again, or moved to the down and removed states

This made us assume that in case we encounter a network partition and do not turn on auto-downing, we should in effect have a cluster with still one leader, as no leader election should take place without gossip convergence (which is not guaranteed if one node is unreachable). Version A:

However, after testing this scenario we did discover that indeed the system behaves as depicted in Version B above, meaning the lower network partition does indeed elect Node 3 (or 2) as a new leader.

Now we understood that the code from the repo above was in fact doing the right thing by relying on the existence of one leader in each partition that can take care of shutting down the cluster, if it is below quorum size.

We find the docs to be not very clear on this. It is in fact important to understand, that you will very easily run into a split-brain problem if you simply choose to ignore the problem. You definitely have to do something about it.

Simulating the network partition

In order to test the split-brain problem and its solution, we had to manually create a network partition. We did so by starting three instances of our service in cluster-mode. We also implemented akka-http-management which exposes an endpoint under /members that you can use to inspect the status of your cluster. It will return a result like this:

{
 “selfNode”: “akka.tcp://PortfolioService@…:2551”,
 ”leader”: “akka.tcp://PortfolioService@…:2551”,
 ”oldest”: “akka.tcp://PortfolioService@…:2551”,
 ”unreachable”: [],
 ”members”: 
 [
   {“node”: “akka.tcp://PortfolioService@…:2551”,”nodeUid”: “..”,”status”: “Up”,”roles”: []},
   {“node”: “akka.tcp://PortfolioService@…:2551”,”nodeUid”: “…”,”status”: “Up”,”roles”: []},…
 ]
}

We then used iptables to simulate a network partition by logging into one of the containers and running

iptables -A OUTPUT -p tcp — dport 2551 -j DROP
iptables -A INPUT -p tcp — dport 2551 -j DROP

This blocks all inbound and outbound traffic on port 2551 to prevent the node from talking to the other nodes. We then monitored the output of the /members endpoint of each node and confirmed that our setup is working fine.

Summary

As you can see the final solution turned out to be relatively straightforward. We are going to roll this out in production and will gather more experience with the Akka cluster setup along the way.

As we are running this cluster on Kubernetes, we will also publish a follow-up post soon to describe the issues we encountered on the infrastructure side of things.

We are constantly on the lookout for great tech talent to join our team — visit our website to learn more and feel free to reach out to us!