Scale out PI computation with kotlin coroutines

Kotlin 1.1 brings some interesting new features. Among them, Kotlin coroutines will have a deep impact on our codebases. Instead of explaining all the possibilities offered by this library, I will show you some ways of using it for specific use cases.

In this first post on coroutines, we are going to use them to scale out some intensive CPU operations on multiple cores. This post is inspired by Akka introduction.

Computing Pi with Leibniz series

Among various way of computing PI, we use Leibniz series to calculate it.

It’s not the best implementation (some are converging faster) but it is quite simple (to code and understand) and highly parallelizable.

Computing on the main thread

Implementing the sum of elements is straightforward. We use a range type (start..end) and the extension function sumByDouble to avoid a loop.

Executing this method with elements from 0 and 1.000.000.000 computes PI with 9 significant digits.

On my laptop, the computation takes 4500 ms. But we can observe that only one core of CPU is used.

Using Kotlin Coroutines to fan out computation

We are going to reuse the previous function sumLeibnizBetween but making it asynchronous.

This code just wraps our previous function in a coroutine. Coroutines can be seen as lightweight threads. You can creates millions of them without problems. The async(CommonPool) gives the context of coroutines: they are created inside threads of java ForkJoinPool. This pool manages its worker threads taking account the number of processors.

Now we use this asynchronous function to dispatch our PI computation in parallel computations.

This function spans out the computation to chunkCount coroutines, each one computing chunkSize Leibniz elements.

Every call to asyncSumLeibnizBetween returns a Deferred<Double>. The last line of the function waits for every asynchronous returns and sum them up.

The runBlocking call is a bridge between regular blocking code and coroutines. It blocks the current thread until coroutines completion. It allows the code to come back from asynchronism to synchronism.

With fanOutSumLeibniz function the computation of 1.000.000.000 elements (10.000 chunks of 100.000 elements) spreads on almost all the cores (8 on this computer) and is 3.75 times faster (1200 ms).

All the code is available on github.