Hpc with scala and akka Actors

In this brief article I will show you how to accelerate the computation of the Mandelbrot set using akka actors in scala (running on aws with 16 vCPU). This is a classic hello-world in parallel programming. I reduced the completion time to 2 seconds, starting from the 24 seconds needed by the sequential version.

We will use an analytic and technological agnostic approach to evaluate the performance gain, that can be reused to compare any kind of parallel programming technologies.

This computation produce a set of complex numbers which has a highly convoluted fractal boundary when plotted: here you can find a visualisation that I realised using scala swing:

Mandelbrot Set visualisation using scala Swing

Basically we are generating a grid of 600x400 complex numbers and then mapping each of them to a colour. This is an hello world in parallel programming (you can find many example using classic HPC technologies like openMP, mpi, etc..) because this process is very easy to parallelize: each pixel can be computed in parallel because it is independent from the others.

The parallel implementation is based on the sequential one found on the rosetta code web site, in the scala section: https://rosettacode.org/wiki/Mandelbrot_set

I split the computation between a Supervisor actor and a set of Workers actors. The first is responsible of coordinating the workers and assembling the result, the workers instead will compute the colour for each pixel.

The messages exchanged are:

package com.dariobalinzo.model
import scala.swing.Color

case object StartComputation
case class PixelJobRequest(x: Int, y: Int)
case class PixelResult(x: Int, y: Int, color: Color)

The StartComputation message triggers the computation, and we have a message for requesting a Pixel computation and a message for the Pixel result.

The code of the Supervisor looks like this:

package com.dariobalinzo.actors

import ...

class Supervisor(val numberOfWorkers: Int) extends Actor {

private val height = 400
private val length = 600
private val bitMap = new RgbBitmap(length, height)
private var remaining = height * length
private var jobRequestingActor: ActorRef = _

val router = {
val routees = Vector.fill(numberOfWorkers) {
val worker = context.actorOf(Props[Worker])
context watch worker
ActorRefRoutee(worker)
}
Router(RoundRobinRoutingLogic(), routees)
}

override def receive: Receive = {
case StartComputation => {
for (y <- 0 until height; x <- 0 until length) {
router.route(PixelJobRequest(x, y), self)
}
jobRequestingActor = sender()
}
case PixelResult(x, y, color) => {
bitMap.setPixel(x, y, color)
remaining = remaining - 1
if (remaining == 0) {
jobRequestingActor ! bitMap
}
}
}

}

The Supervisor handles two types of messages: the start computation and the PixelResult. When the first is received, we route the job requests among the Workers actors using an akka router. I used a simple RoundRobinRoutingLogic to dispatch the job among the workers in a round robin way.

The code of the Worker Actor looks like:

package com.dariobalinzo.actors

import ...
class Worker extends Actor {

override def receive: Receive = {
case PixelJobRequest(x, y) => {
val complex = Complex(xMin + x * cx, yMin + y * cy)
val iterationResult = Mandelbrot.itMandel(complex, maxIterations, bailout = 4)
val pixelColor = Mandelbrot.getColor(iterationResult, maxIterations)
sender() ! PixelResult(x, y, pixelColor)
}
}
}

object Worker {
val maxIterations = 1000
val xMin = -2.0
val xMax = 1.0
val yMin = -1.0
val yMax = 1.0
val height = 400
val length = 600
val cx = (xMax - xMin) / length
val cy = (yMax - yMin) / height
}

The Mandelbrot set computation is encapsulated inside the Mandelbrot object. For a given pixel (x,y) we produce an output (x,y,colour). This part of the computation is cpu-intensive.

The Main class instead is only a driver program that trigger the computation, measures the time and display the results.

Since the number of workers is parametric, we can study the performance gain obtained with a given number of Worker akka actors.

The results are displayed here in this graph:

Completion Time (seconds) in function of the number of worker actors (1 to 20)

I’ve done the experiment on aws with c5.4xlarge ec2 instance: a machine with 16 vCPU. The sequential computation time is 24 seconds, instead the parallel computation with more than 13 workers is about 2 seconds only!

The performance is unchanged with more than 13 workers due to several factors: first we have to take in consideration hardware limitations (virtual cpu), a light overhead of the akka freamework needed for threads management and actors run-time. Last but not least, I have used a simple round robin router, that may lead to load unbalance among the workers: in fact each pixel needs different time to be computed and we might have overloaded an unlucky worker actor with many heavy pixel tasks (the total completion time will be the maximum of each worker computation time).

A good way to see how far we are from an ideal parallelization is to divide the completion time of the sequential computation, by the completion time of each experiment. We obtain in this way the scalability graph:

Ideal scalability (red) vs actual scalability (blue)

Ideally using a parallelism of degree n, we would like to go n times faster! However our implementation is not so bad: we obtained a scalability of 12 using 13 parallel workers (going 12 times faster than the sequential solution).

I think that this kind of graph is very useful (it is technology agnostic and can be reused with every parallel technology): we can measure analytically how well we are taking advantages from the parallelism and can lead us to correctly size the number of resources allocated.

Obtain performance near the ideal is difficult for high parallelism degree, and a lot depends on the type of computation (is it cpu-intensive, or I/O intensive? fine-grained or coarse-grained?) and on the way the job is divided among the parallel executors (are you scheduling the jobs with a static or dynamic approach?).