Parallel Programming with Scala — Leetcode #113 “Path Sum II”

Florian Baierl
7 min readAug 8, 2021

--

Recently, I have stumbled across the Leetcode problem #113 “Path Sum II” and wondered if this could be solved more efficiently by leveraging parallel programming in a multi-core environment. The following article shows details about how to design, implement and benchmark a parallel algorithm using a variable number of threads to solve this problem.

Problem Description

Given a binary tree and a sum, find all root-to-leaf paths where each path’s sum equals the given sum.

Constraints:

  • The number of nodes in the tree is in the range [0, 5000].
  • -1000 <= Node.val <= 1000
  • -1000 <= targetSum <= 1000

Example:

// input:       5
/ \
4 8
/ / \
11 13 4
/ \ / \
7 2 5 1
sum = 22// desired result:[
[5,4,11,2],
[5,8,4,5]
]

Preparation

First, we need a function to enable us to run two tasks in parallel. For that reason, we define parallel like this:

def parallel[A, B](taskA: => A, taskB: => B)
(implicit ec: ExecutionContext): (A, B) = {
val ta = Future(taskA) // taskA execution starts
val b = taskB // taskB execution starts
val a = Await.result(ta, Duration.Inf)
(a, b)
}

When called, parallel will run taskA and taskB at the same time and return both their results as a tuple once both tasks have finished. For this to work, parallel leverages by-name parameters. This allows us to control exactly when taskA and taskB are run (see comments).

Additionally, we copy the TreeNode class from the assignment and add some useful extensions to it. Luckily, Scala has the concept of implicit conversions, which allows us to add functionality to the TreeNode class even after it was defined (which for the Leetcode assignments happens outside the scope of our solution). TreeNodeExtension conveniently hides some tedious null-checks and adds two additional functions: isLeaf and prettyPrint.

// defined by Leetcode
class TreeNode(_value: Int = 0,
_left: TreeNode = null,
_right: TreeNode = null) {
var value: Int = _value
var left: TreeNode = _left
var right: TreeNode = _right
}

implicit def treeExtension: TreeNode => TreeNodeExtension = TreeNodeExtension

case class TreeNodeExtension(node: TreeNode) {
def isLeaf: Boolean = node.left == null && node.right == null
def leftOpt: Option[TreeNode] = Option(node.left)
def rightOpt: Option[TreeNode] = Option(node.right)
def prettyPrint: String = s"(" +
s"${node.value}, " +
s"${leftOpt.map(_.prettyPrint).getOrElse("null")}, " +
s"${rightOpt.map(_.prettyPrint).getOrElse("null")})"
}

Lastly, I find the List[List[Int] data type used for the result type to be a bit cumbersome, so I added a new type Path defined like this:

type Path = List[Int]

This way we can express a list of paths in a more idiomatic way: List[Path].

Writing a sequential solution

The problem we want to tackle is of ‘medium’ difficulty and the sequential solution is quite straight-forward. We can calculate all possible root-to-leaf paths that sum up to targetSum sequentially (running on one single thread) using a simple recursive function:

def pathSumSeq(root: TreeNode, 
targetSum: Int,
acc: List[Path] = Nil,
current: Path = Nil): List[Path] = {
def nextPath: Path = current :+ root.value
(root.leftOpt, root.rightOpt) match {
case (Some(left), Some(right)) =>
pathSumSeq(left, targetSum - root.value, acc, nextPath) ++
pathSumSeq(right, targetSum - root.value, acc, nextPath)
case (Some(left), _) =>
pathSumSeq(left, targetSum - root.value, acc, nextPath)
case (_, Some(right)) =>
pathSumSeq(right, targetSum - root.value, acc, nextPath)
case (_, _) if root.isLeaf && root.value == targetSum =>
acc :+ nextPath
case _ =>
acc
}
}

Writing a parallel solution

After designing the sequential solution pathSumSeq we can easily deduct a parallel version of this algorithm: pathSumPar. Notice, how if we encounter a node with a left and a right child node, pathSumSeq will recursively call itself twice for each child node. These calculations could trivially be done in parallel since we do not need to share any data between them. Luckily for us, we have defined just the right method for that (parallel):

val (a, b) = parallel(
pathSumPar(left, targetSum - root.value, acc, nextPath),
pathSumPar(right, targetSum - root.value, acc, nextPath)
a ++ b // result

This will spawn a new thread that will follow the left tree while the current thread will follow the path down the right tree (see definition of parallel).

It does not make much sense to start new thread every single time we encounter a node with two children. I am currently writing this on a rather old laptop with an Intel Core i7–4600U CPU with 2 cores and 4 threads. It would therefore make sense to have exactly 4 threads working on the calculations.

We can easily ask for the number of available cores by calling:

val cores: Int = Runtime.getRuntime.availableProcessors()

So all that is left now, is to write the final parallel version of our algorithm:

def pathSumPar(parallelism: Int)
(root: TreeNode,
targetSum: Int,
acc: List[Path],
current: Path)
(implicit ec: ExecutionContext): List[Path] = {
if(parallelism <= 0)
pathSumSeq(root, targetSum, acc, current) // *1
else {
def nextPath: Path = current :+ root.value
(root.leftOpt, root.rightOpt) match {
case (Some(left), Some(right)) =>
val parF = pathSumPar(parallelism - 2) _ // *2
val (a, b) = parallel( // *3
parF(left, targetSum - root.value, acc, nextPath),
parF(right, targetSum - root.value, acc, nextPath))
a ++ b
case (Some(left), _) =>
val parF = pathSumPar(parallelism) _
parF(left, targetSum - root.value, acc, nextPath) // *4
case (_, Some(right)) =>
val parF = pathSumPar(parallelism) _
parF(right, targetSum - root.value, acc, nextPath) // *4
case (_, _) if root.isLeaf && root.value == targetSum =>
acc :+ (nextPath)
case _ => acc
}
}
}

We added a new input parameter parallelism, that will be deducted by 2 (*2) any time we use our parallel method. Once parallelism hits 0, we fall back to our normal sequential pathSumSeq function (*1) to iterate through the remaining paths to the leafs in a sequential (non-parallel) manner. As long as we do ‘not run out of’ parallelism , we start a new thread any time we encounter a node with two children (*3). If we encounter a node with only one child, we do not deduct any parallelism and recursively call pathSumPar again with the child node (*4).

The following graphic shows how parallelism changes for each iteration, starting with 4. The original calling thread will traverse down the path 5 -> 8 -> 4, spawning new threads for the left children at 5 and 8 respectively. The new thread spawned at node 8 traverses down to 13 where itself will not spawn another thread since it has run out of parallelism (parallelism = 0). Instead it will use the sequential algorithm to continue the tree traversal to the leafs.

        5           parallelism: 4 
/ \
4 8 parallelism: 2
/ \ / \
11 10 13 4 parallelism: 0, use sequential algorithm
/ \ / \
7 2 9 1

Finally, we can call pathSumPar with parallelism = /* number of available cores */ , like that:

def pathSum(root: TreeNode, targetSum: Int): List[Path] = {
import scala.concurrent.ExecutionContext.Implicits.global
val cores = Runtime.getRuntime.availableProcessors()
if(root == null)
Nil
else
pathSumPar(cores)(root, targetSum, Nil, Nil)
}

Notice that for our Futures to work, we need an implicit ExecutionContext . We can use the default one by adding:

import scala.concurrent.ExecutionContext.Implicits.global

Benchmarking

To benchmark our new parallel algorithm, we first need to define a helper function that can construct arbitrarily long input data trees:

private def constructTestTree(depth: Int): TreeNode = {
val max = 1000
val min = -1000
val seed = 123
val rand = new scala.util.Random(seed)

def construct(rand: Random, depth: Int): TreeNode = {
val value = rand.nextInt(max - min) + min
if(depth - 1 <= 0) new TreeNode(value)
else new TreeNode(value, construct(rand, depth - 1), construct(rand, depth - 1))
}

construct(rand, depth)
}

This method will create complete binary trees with 2^(depth+1)-1 nodes. Notice, that we are using a fixed seed for the random number generator. This will make sure that we are always constructing the same input tree for our tests.

For benchmarking the performance of our solution, we can use ScalaMeter, a microbenchmarking and performance regression testing framework for the JVM platform. We need a bit of configuration for the framework to execute correctly:

val standardConfig = config(
Key.exec.minWarmupRuns := 20,
Key.exec.maxWarmupRuns := 40,
Key.exec.benchRuns := 15,
Key.verbose := false
) withWarmer Warmer.Default()

This will make sure that we are executing 15 benchmark runs after ‘warming up’ for 20 to 40 rounds.

We can then construct an input tree and benchmark our two algorithms, printing out the results:

test("the parallel algorithm should be faster than the sequential algorithm for large data input") {

val standardConfig = config(
Key.exec.minWarmupRuns := 10,
Key.exec.maxWarmupRuns := 30,
Key.exec.benchRuns := 15,
Key.verbose := false
) withWarmer Warmer.Default()

val randTree = constructTestTree(23) // 16777215 nodes

val seqtime = standardConfig measure {
pathSumSeq(randTree, 7)
}

val partime = standardConfig measure {
pathSum(randTree, 7)
}

val speedup = seqtime.value / partime.value
println(s"sequential time: $seqtime")
println(s"parallel time: $partime")
println(s"speedup: $speedup")

assert(speedup > 1.5)
}

For an input tree with 16777215 nodes, this gives us the following result on our Intel Core i7–4600U CPU device:

// running on an Intel Core i7–4600U CPU (dual core)
sequential time: 1634.2193456000002 ms
parallel time: 962.4488567333333 ms
speedup: 1.69798045284893

So in the end, we have beaten the sequential algorithm.

To be honest though, I would have expected a greater speedup. For that reason, I tried running the exact same algorithm on a device with a quad core CPU (Intel Core i7–6820HQ), and as expected, the speedup increased:

// running on an Intel Core i7–6820HQ CPU (quad core)
sequential time: 1923.6032400000001 ms
parallel time: 809.2381533333333 ms
speedup: 2.3770545569020505

As to why the slower CPU outperformed the quad core when it comes to sequential time, there are some other factors that could be responsible:

  • Different operating systems (the quad core device is running Windows, the dual core device is running Ubuntu)
  • Different JVM versions
  • Background workload

But since we are only interested in the relative speedup of the dual core vs quad core runtime, I’ll leave that discussion for another time.

BTW: Leetcode uses a data set with a maximum of 5000 nodes. In this case, unfortunately, our parallel algorithm does not seem to significantly outperform its nonparallel counterpart. Also, I am not sure if Leetcode even runs their programs in a multi-core environment.

The full code can be found in my github repository.

--

--