Understanding the implementation of the Blelloch Algorithm (Work-Efficient Parallel Prefix Scan)

Shivam Mohan
Nerd For Tech
Published in
5 min readOct 9, 2022
Parallel Prefix Scan

Blelloch Algorithm

The Blelloch parallel prefix scan algorithm consists of two steps:

  1. Reduction Phase/Up-sweep: Up-sweep is the first phase where the objective is to build a partial prefix scan result for parts of the array in parallel. In this phase, we divide the array among all the participating threads, which then pair-wise computes the operation for the array elements belonging to their assigned portion.
Up-sweep

2. Down-sweep: Down-sweep is the second phase of the algorithm, which executes after the up-sweep is completed. As a result of the up-sweep we get partial prefix scan output for subsets of the array, now to complete the scan, we first replace the root with an identity element (specific to the operator, 0 for addition) and then follow the below algorithm:
On every iteration, each node passes its own value to its left child and passes to its right child the result of the operator applied to itself and its left child in the up-sweep tree.

Down-sweep

Note: The above algorithm performs exclusive parallel prefix scan, however, we can easily convert exclusive scan to inclusive scan by doing one additional operation.

Implementation

To implement a work-efficient parallel prefix scan on the lines of the algorithm proposed by Blelloch, let us look at 2 methods.

In the first approach, we divide the entire array (equally) among all the participating threads, each responsible for the computation of upsweep & downsweep operations for its portion of the array which would change after every iteration.
Although implementing this is quite straightforward, this approach had some visible caveats that can impact the performance severely.

The first problem is that the amount of parallelism gets reduced at every step, as after every iteration portion assigned to a few threads gets doubled whereas few threads have no elements to work on and remain idle, as a result of which, for larger inputs, the time taken to complete the parallel scan (with fewer threads) is much higher than the sequential version.

The second problem is that the approach only works when the number of threads is a power of two, and we have to reduce the number of threads to the closest power of two (less than), although this makes the implementation relatively easy, and also does not have any impact on the worst-case complexity, it is not efficient to throw away at most half of the processors when using them could result in significant improvement in the performance.

Hence let us try a different approach, taking a hint from Blelloch’s paper, where he outlined an approach for performing this work-efficient scan when the number of processors is fixed and the number of elements is larger than the processors. i.e. n >> p.

Processor Sum

The approach we finally take is as follows:

Assuming the number of input elements is n and the number of threads is t, perform the following steps.

  1. Divide the input array among t threads (which results in n/t blocks) and calculate the scan operation on these n/t parts in parallel.
  2. After the completion of the above step, we would have a processor-scan array whose size would be equal to the number of threads, and every element of this array would be the result of the scan operation of the elements of each individual block.
  3. Then on this processor-scan array, we perform up-sweep & down-sweep.
  4. The result of the up-sweep & down-sweep of the processor-scan array would be an array, elements of which can be used as an offset for calculating the prefix scan of the n/t parts.
  5. In the final step, we use these offsets and calculate the prefix scan result for all the n/t parts in parallel.

Complexity analysis

We can calculate the complexity of the steps outlined in the above approach:

1. Complexity of calculating scan operator for every thread: We divide the array into n/t parts and then perform the scan operation in parallel, hence the complexity for this step would be O(n/t)

2. Complexity of up-sweep & down-sweep: The complexity of the upsweep & downsweep algorithm is log(n), where n is the number of elements. The number of elements for our case is t. Hence the complexity of this step would be log(t).

3. Complexity of final calculation: In the final step we consume the offsets from the array produced after upsweep & downsweep and use it to compute the prefix scan operation for all n/t parts of the array in parallel. The complexity of this step would again be O(n/t).

Therefore the combined complexity of this approach is:

= O(n/t + log(t) + n/t )
=O(2n/t +log(t))
=O(n/t +log(t))
= O(n/t )

Work Efficient

A parallel algorithm is said to be ‘work efficient’ if it does asymptotically no more work than the sequential version.
We can calculate the number of operations performed in this algorithm as shown below:

1. Processor Scan: The number of operations done at this step is n.
(n/t)∗t.
2. up-sweep & down-sweep: The number of operations done at this step is 2*(t-1).
3. Final Calculation: The number of operations done at this step is n ->(n/t)∗t.
Total Operations = n+2(t−1)+n when n >> t,
Total Operations = 2n

Hence the above implementation of the algorithm is work efficient.

Reference:

  1. Guy E. Blelloch. “Prefix Sums and Their Applications”
  2. Parallel Prefix Sum (Scan) with CUDA

Code: https://github.com/shivmoha/parallel-prefix-scan

--

--