Akka cluster on kubernetes. Traps and pitfalls

In this article I’m going to describe traps and pitfalls we’ve encountered moving from manually managed single-instance services to scaled Akka Cluster apps deployed in Kubernetes. The applications itself were generally cluster-aware but not scaled. As it came out, existing solutions, you may find across the web, still require some tuning or just miss corner cases. I suppose the reader is acquainted with Akka and Kubernetes basic terms. I won’t touch all subtopics related to deploying Akka app in k8s. You may check it out there. I’ll describe cluster configuration with static number of nodes. For things with split brain resolving get far trickier having dynamic node quantity.

Initial cluster deployment begins with so-called seed nodes. In the simplest scenario you need to provide static list of seed nodes in application config. In my examples I’ll stick to this way, though it’s possible to create cluster in Kubernetes using akka-discovery.

  1. No cluster is alive yet.
  2. Every seed node will try to join others not accepting connections by itself for akka.cluster.seed-node-timeout 5s by default.
  3. The first node in the list will join itself to form the initial cluster. That’s why you should provide the same seed-nodes list for all cluster members. Otherwise split brain may occur just during bootstrap process. When using akka-discovery, node with the alphabetically lowest pod name will perform self-join.
  4. Other nodes eventually join the cluster retrying connection with the same interval.
  5. The application entities (Shard Regions and Cluster Singletons) won’t be alive until the minimum number of members akka.cluster.min-nr-of-members joined.

I’d like to say that you should not run worker nodes until the initial cluster is active. Docs say you may ensure it using akka.cluster.bootstrap.required-contact-point-nr, though I did not try it.

In Kubernetes we have two general options to deploy our app: ReplicaSet and StatefulSet. The second allows each pod to have a consistent identifier that may be used to address it in the Kubernetes network. That’s the thing we need to ensure seed nodes can be listed and addressed to.

kind: StatefulSet
spec:
serviceName: my-clustered-app-cluster
replicas: 3
updateStrategy:
type: RollingUpdate
podManagementPolicy: OrderedReady
spec:
containers:
- name: my-clustered-app
image: my-clustered-app:1.0
ports:
- name: 2552
containerPort: 2552

Statefulset with OrderedReady pod management strategy deploys nodes one-by-one looking at readiness probe. We use Akka Remote port binding as readiness probe. Usually apps spin up really fast so probe delay and period could be set at minimum.

readinessProbe:
tcpSocket:
port: 2552
initialDelaySeconds: 0
periodSeconds: 1

Though the app is not always ready to accept cluster connections as soon as the port is bound. So I found useful to reduce retry timeout on networking layer to speed up bootstrapping.

akka.remote.retry-gate-closed-for = 1s

Having cluster deployed across several data centers (or availability zones) brings more to consider. You must ensure that every zone has a seed node. Otherwise if AZ with seed nodes is down or have network split, other nodes won’t be able to join cluster. It can be achieved with pod (anti-)affinity. Generally it allows you to label which zone particular worker node belongs to. Recently Lightbend team introduced multi-dc configuration but in my opinion it’s far from being useful for traditionally sharded apps. You are forced to make single explicit choice between consistency and availability.

affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-clustered-app
topologyKey: your-awesome-node-label

There’s still a possibility of cluster split. If AZ with the first seed-node goes down, then up but still without network, the node will join itself thinking there’s no cluster yet. It will stay waiting forever for other nodes to come. Though it’s safe as long as the main cluster is up (remember application entities won’t be started on top of an incomplete cluster), we still need custom liveness probe to periodically restart such “singleton” node.

The probe with corresponding akka-http endpoint will look like this:

import akka.cluster.Cluster
import akka.http.scaladsl.model.StatusCodes
import akka.http.scaladsl.server.Directives._
import akka.http.scaladsl.server.Route
import cats.implicits._
import com.evotor.common.livenessProbe.LivenessProbe._
import monix.eval.Task
import monix.execution.Scheduler

trait LivenessProbe[F[_]] {
def getStatus: F[Either[Error, Unit]]
}

object LivenessProbe {
type Error = String
}

final class ClusterLivenessProbe(cluster: Cluster, minNrOfMembers: Int)
extends LivenessProbe[Task] {
def getStatus: Task[Either[Error, Unit]] = Task.eval {
val members = cluster.state.members
if (members.size >= minNrOfMembers) {
().asRight
} else {
s"Cluster has less members than required. Current members: ${members.mkString(",")}".asLeft
}
}
}

object ClusterLivenessProbe {
def apply(cluster: Cluster, minNrOfMembers: Int): ClusterLivenessProbe =
new ClusterLivenessProbe(cluster, minNrOfMembers)
}

object LivenessProbeRoute {
def apply(probe: LivenessProbe[Task]): Route =
pathPrefix("alive") {
extractExecutionContext { ec =>
implicit val scheduler: Scheduler = Scheduler(ec)
get {
onSuccess(probe.getStatus.runAsync) {
case Right(_) => complete(StatusCodes.OK)
case Left(error) => complete(StatusCodes.custom(500, "Cluster's not healthy", error))
}
}
}
}

def clusterLiveness(cluster: Cluster, minNrOfMembers: Int): Route = {
val probe = ClusterLivenessProbe(cluster, minNrOfMembers)
apply(probe)
}
}

And kubernetes config part below:

livenessProbe:
httpGet:
path: /alive
port: 9000
initialDelaySeconds: 20
periodSeconds: 3

Initial delay should not be too small, not less than twice akka.cluster.seed-node-timeout. Otherwise it may flaw bootstrap process.

In real world all kind of technical trouble is possible. So a group of nodes may fall out of living cluster. This situation is called Split Brain. The worst scenario is network split when disconnected nodes keep working and process requests. Cluster leader will notice the absence and will try to reallocate load, eventually starting sharded actors or cluster singletons that may still run on the absent nodes. This may cost days of untangling corrupted data. To fight such cases Split Brain Resolver functionality is available. Generally the goal of all strategies is to define and keep cluster quorum while getting down all split isles. While there’s a commercial SBR available from Lightbend, you can check out code example here it’s really simple, at least for static cluster size.

One more config option to consider here is akka.cluster.down-removal-margin. You should turn it on. This adds consistency margin not only for emergency shut down by SBR but also for Cluster Singletons and Shard Regions migration during rolling update.

akka.cluster.down-removal-margin = 3s

Even with all these things addressed, we still had cluster instability. Rolling Updates caused cluster to crash and rebootstrap again. As it came out there were two problems. Firstly, cluster members did not perform handover. Prior to 2.5 it was tricky task in Akka. In 2.5 fully automatic coordinated shutdown were introduced. So we had to update. Secondly, with coordinated shutdown default timeouts members still had little time to perform cluster exiting stage. We had to increase it.

akka.cluster.coordinated-shutdown.phases.cluster-exiting.timeout = 30s

Along with kubernetes termination grace period.

spec: 
template:
spec:
terminationGracePeriodSeconds: 35

Good application should turn itself off completely after receiving SIG_TERM, so config string below is also essential.

akka.cluster.coordinated-shutdown.exit-jvm = on

Good to remember that since version 2.5 Cluster does not use a database to store its state by default. It lays in Akka Distributed Data instead that lives on as long as the whole cluster is alive.

From architectural point of view clustering should not mess you by itself. It is still better for every service to have its own Akka cluster for more resilience following general microservice architechture.

I hope, this article will make your hakking easier.

List of useful resources: