Block Sparse Matrix-Vector Multiplication with CUDA

Published in

GPGPU

7 min readDec 28, 2019

--

In the previous post, we’ve discussed sparse matrix-vector multiplication. It was shown that it’s possible to take advantage of knowledge about a position of zeroes by storing matrices in special data structures. Although we’ve improved performance and memory space requirements, we haven’t used all the information about zeroes position. In some applications, non-zeroes are gathered in blocks. The knowledge about these blocks could give us more room for optimization. In this post, I’m going to discuss the efficiency of block sparse matrix-vector multiplication on GPU. To show some real-live application results, I develop a Matrix Structural Analysis application, which is used to simulate the Golden Gate bridge structure.

Block Compressed Sparse Row (BCSR)

BCSR is one of the most popular block sparse matrix formats. In BCSR, all blocks have the same size. To understand this format imagine a sparse matrix with the block size equal to one. In this case, CSR and BCSR matrix representations are equivalent. Block size increasing doesn’t affect the column and row pointer arrays. Instead, it just extends the values array. That is, columns and row pointer arrays contain values for blocks. Blocks are stored in the value array contiguously. The modification of CSR dramatically reduces memory space requirements.

To access block rows data, we need to get a number of blocks before the block row (row_ptr[block_row]) and multiply this value by block size square. It’s possible to store blocks in row-major or column-major order. As I’ll show further, elements order within blocks might affect performance. I’ll start with row-major order.

In listing 1 I assign one thread per one non-block row. To test the performance of BCSR implementations I’ve generated N-diagonal block sparse matrices. These matrices are suited for now, because we want to exclude load balancing issues.

The best speedup of this BCSR SpMV is about 65.8 (for block size equal to 16). Although the kernel achieves coalesced access to data of matrix for small block sizes, it drops performance for block size equal to 32. I used float matrices in this post, so each thread in the warp read data in an exact cache line of 128 bytes.

Figure 2 — row-major order BCSR SpMV performance

In contrast, cuSPARSE implementation of SpMV for block sparse matrices doesn’t seem to have such a dramatic performance drop. It means that we have some room for optimizations. The obvious solution is to assign warp per non-block row. That would give us fully-coalesced memory access.

This optimization improves the performance in case of matrices with big block sizes. Although it is cheap to perform inner warp reduction, the new version drops performance for small blocks.

Figure 3 — row-major order BCSR SpMV performance (warp per non-block row)

The less obvious optimization is to use column-major order within each block.

This change allows coalesced access to block. That means that threads access data of block consecutively.

Figure 4 — column-major order BCSR SpMV performance

Now it might be noticed, that threads read vector x consecutively. We could try to cache it.

But it doesn’t affect performance at all. Is it the limit? I don’t think so.

Figure 5 — column-major order BCSR SpMV performance (shared x)

We could allocate warp per block row in another way. The new algorithm is claimed to be more efficient in handling matrices with block sizes that are not powers of two. In the algorithm warp iterates through its block row’s columns, covering 32/bs columns at a time. The algorithm limits max block size by warp size. Although this optimization would limit matrix block size, it might be quite interesting to try.

The algorithm must compute a block index and a horizontal position within the block. In the paper, where I found this algorithm, it’s noted that the algorithm comes at the cost of increased latency from integer operations. Is it possible that the column-by-column algorithm performance (fig. 6) is limited by integer division operations?

Figure 6 — column-major order BCSR SpMV performance (column by column)

The NVIDIA best practices guide states “Integer division and modulo operations are particularly costly and should be avoided or replaced with bitwise operations whenever possible”. To improve performance I’ve tried int_fastdiv library. Unfortunately, it doesn’t improve performance a lot (fig. 7).

Figure 7 — column-major order BCSR SpMV performance (column by column, fastdiv)

Is it possible that shift operations solve our performance issue?

Although it does outperform int_fastdiv (fig. 8), it’s still not enough.

Figure 8 — column-major order BCSR SpMV performance (column by column, po2)

Is there a way for further optimization of integer divisions? Well, it’s better to not perform them at all. Let’s perform division at compile-time.

The templated version is quite close to our best version. Its performance is achieved not only by integer division optimization. The compiler also unrolled inner loops.

Figure 8 — column-major order BCSR SpMV performance (column by column, template)

I think it’s time to stop using ideal cases of N-diagonal matrices. To test the proposed kernels on real-life matrix structures I’ve selected matrix structural analysis area.

BCSR with non-uniform pattern and matrix structural analysis

Structural analysis is the process of predicting the displacements within a given structure under a prescribed loading condition. I’m not going to describe a matrix structural analysis here. Instead, I’m going to describe the way a mesh affects a matrix structure. To get the idea of global stiffness matrix formation we need to understand the structure of a member stiffness matrix k for each element in mesh:

where L denotes length, E — Young’s modulus of elasticity and A — cross-sectional area of an element. Also, we need a transformation matrix T.

To get a member stiffness matrix for the particular element in global coordinates we need to perform:

That gives us a 4x4 matrix. To assemble it into a global stiffness matrix we need to consider K elements position. The first two rows of the K matrix are assigned for the first element’s node. The first two columns of the first two rows are to be added into the first nodes’ diagonal block. The second two columns of the first two rows are to be added into the first node’s off-diagonal block. The same logic is applied for the second two rows and the second node of the element. That gives us the global stiffness block matrix, which blocks size are equal to 2 and which rows and columns count is equal to nodes count of a mesh. It’s possible to increase the block size of the stiffness matrix by considering a plane frame structure instead of plane trusses. To compute plane frame structure, k and T matrices are to be changed:

where I denotes a moment of inertia. To vary a stiffness matrix size I’ve written parametrized Golden Gate Bridge structure generator.

There is a repeating structure in this mesh. So it’s easy to vary segments count or length of the bridge to control global stiffness matrix size.

Figure 10 — Generated Golden Gate Bridge repeating structure

The Golden Gate Bridge produces a matrix structure that is shown in figure 11. It has a more irregular pattern than a simple N-diagonal matrix.

Figure 11 — Golden Gate Bridge stiffness matrix structure

The performance of different BSpMV kernels is presented in figure 12. In my experiments, the column-major sparse matrix outperformed other options when I assigned thread per non-block row.

Figure 12 — Golden Gate Bridge stiffness matrix BSpMV performance

It’s important to note that load disbalance questions weren’t considered in the scope of this post. I hope to consider this questions in one of the following posts. As always, you could find the source codes for this post on github.

Block Sparse Matrix-Vector Multiplication with CUDA

Block Compressed Sparse Row (BCSR)

BCSR with non-uniform pattern and matrix structural analysis

Written by Georgii Evtushenko