Sparse Matrix-Vector Multiplication with CUDA

Georgii Evtushenko
Analytics Vidhya
Published in
10 min readNov 16, 2019

--

Introduction

Standard methods of differential equations discretization usually lead to systems of linear equations. General feature of produced systems is that the number of entries in each equation depends on local topological features of the discretization. Thus, the matrices generated by these systems contain a lot of zeroes (fig. 1). It’s possible to take advantage of knowledge about position of zeroes by storing matrices in special data structures. The abstract data type for these structures is called sparse matrix. While I was reading about yet another matrix format, I decided to actualize the comparison of performances of different matrix formats. This post provides an review of efficiency for basic sparse matrix data structures in the context of sparse matrix-vector multiplication (SpMV) on GPU.

Figure 1: A simple finite element mesh model

Data Structures for Sparse Matrices

In general, SpMV performance is limited by memory bandwidth. The storage formats, which are used for the sparse matrices define SpMV algorithms. Each of these algorithms has its own granularity, which impacts performance. The primary distinction among sparse matrix representations is the sparsity pattern, or the structure of the non-zero entries, for which they are best suited. However, I’ll start with general sparse matrix formats.

To access the efficiency of SpMV on different sparse matrix formats, I’ve collected performance data on general matrices from Florida Sparse Matrix Collection. All of the experiments were run on a system with NVIDIA RTX 2080 GPU paired with an Intel Core i7–7700k CPU. Each of the measurements is an average (arithmetic mean) over 30 trials. Before measuring performance, both CPU and GPU frequency were fixed.
The speedup was computed by dividing single thread CSR SpMV execution time by GPU one.

CSR

The Compressed Sparse Row (CSR) format is a general sparse matrix format. CSR format consists of three arrays: row_ptr, columns of non-zeroes, and matrix values (fig. 2). The non-zero values of the row are stored consequentially in an one-dimensional values array. The row_ptr array is used to divide values array into separate rows. Its size is equal to n_rows + 1. The last entry in row_ptr stores a number of non-zeroes (nnz) in the matrix. That allows fast querying of non-zeroes number in a particular row (row_ptr[row + 1] − row_ptr[row]). For each non-zero value column index is stored in columns array.

Figure 2: Example of Compressed Sparse Row (CSR) matrix format

Let’s assume for simplicity that there are four threads in each CUDA thread block. General CSR SpMV implementation works at the granularity of threads per row (fig. 3). Hence, the matrix in figure 2 is processed by three thread blocks. This implementation is usually referenced as CSR-Scalar (list. 1).

Figure 3: CSR-Scalar work distribution
Listing 1: Naive SpMV kernel for the CSR-Scalar sparse matrix format

Presented implementation of CSR SpMV algorithm on GPU is usually considered very inefficient. The reasons of inefficiency are load balancing, thread divergence, and memory access pattern. As shown in figure
3, only half of the block threads has non-zeroes to process. Thus, a single dense row can arbitrarily delay the execution while all the other cores are idle. Moreover, as shown in figure 3, adjacent threads access matrix
values in a strided way. When concurrent threads simultaneously access memory addresses that are far apart in physical memory, then there is no chance for the hardware to combine the accesses. Performance results
for naive CSR-Scalar implementation are presented in table 1.

Table 1: CSR-Scalar speedup

The speedup distribution is shown in figures below. To answer the question how naive described implementation really is I’ve compared it with the NVIDIA CUDA Sparse Matrix library (cuSPARSE) CSR implementation (tab. 2), which has a better average speedup.

Table 2: CSR (cuSPARSE) speedup
Figure 4(a): CSR-Scalar speedup (float)
Figure 4(b): CSR-Scalar speedup (double)

These results show that there is a room for optimization of CSR SpMV. The first possible optimization is to assign warp per row instead of thread. This algorithm (list. 3) is called CSR-Vector. The vector kernel accesses indices and data contiguously (fig. 4), and therefore overcomes the principal deficiency of
the scalar approach. Unlike the previous CSR implementation, which uses one thread per matrix row, this optimization requires coordination among threads within the same warp.

CSR cuSPARSE speedup (float)
CSR cuSPARSE speedup (double)
Figure 4: CSR-Vector work distribution

In the case of CSR-Vector reduction might be implemented using warp-level primitives (list. 2). In that case, the data exchange is performed between registers and more efficient than going through shared memory, which requires a load, a store, and an extra register to hold the address.

Listing 2: Warp reduction
Listing 3: SpMV kernel for the CSR-Vector

CSR-Vector has better speedup (tab. 4) and speedup distribution than CSR-Scalar (for both float and double matrices) and cuSPARSE implementation (for float matrices).

Table 3: CSR-Vector speedup
CSR-Vector speedup (float)
CSR-Vector speedup (double)

However, CSR-Scalar outperforms CSR-Vector on about 33% of float matrices with 10000 nnz lower limit and on 40% of float matrices with 100000 nnz lower limit. On that matrices, CSR shows average speedup equal to 8.57 while CSR-Vector only 4.80.

To discover further improvements of CSR SpMV implementation, we need to consider the first matrix part from figure 2. In the first four rows of the matrix, there is only one non-zero value per row. In that case all threads of warp except first are idle. In this case, it’s possible for naive CSR SpMV implementation to outperform vector implementation. There is an SpMV algorithm for the CSR matrix format that doesn’t depend on nnz/row ratio. The CSR-Adaptive changes it’s behavior depending on the nnz in each row (list. 4). After selecting non-zeroes per block value, additional array (row blocks) for storing rows of block is constructed. If some rows contain small nnz, they’ll be gathered into one block. Then CUDA threads block is assigned to each block of rows. The case of multiple rows in one block of rows is called CSR-Stream. If there is only one row in block of rows, the CSR-Vector will be called. If this row exceeds nnz_per_wg than CSR-VectorL variant will be used. The main difference between CSR-Vector and CSR-VectorL is that CSR-VectorL allows executing multiple CSR-VectorL on one row and then reducing the results by using atomic operations.

Listing 4: SpMV kernel for the CSR-Adaptive sparse matrix format

The CSR-Vector and CSR-VectorL parts are quite similar, so I won’t include listing here. Figure 5 illustrates memory access pattern of the CSR-Stream part. It stores partial sums in shared memory of GPU and then reduces them. The partial results in cache in figure 5 are calculated with x filled with 1. The
source code of CSR-Stream is presented in listing 5.

Figure 5: CSR-Stream memory access pattern
Listing 5: CSR-Stream implementation

On the discussed set of matrices, where CSR outperformed CSR-Vector, CSR-Adaptive shows better speedup. CSR-Adaptive outperforms CSR-Scalar on those 291 matrices. Although CSR-Adaptive might be outperformed by CSR-Vector on some long-row matrices, it has better speedup in average (tab. 4). The main advantage of CSR-Adaptive is that you won’t need to change the code that generates a matrix if your code already uses CSR. The matrix formats presented below don’t have this quality.

Table 4: CSR-Adaptive speedup
CSR-Adaptive speedup (float)
CSR-Adaptive speedup (double)

ELL

The problem of noncoalesced memory accesses of CSR can be addressed by applying data padding and transposition on the sparse matrix data (fig. 6). The Ellpack-Itpack (ELL) sparse matrix format assumes that each row contains at most elements in rows elements and elements in rows is small. All rows are zero-padded to that value. Unlike CSR, the rows pointers array is of no need. ELL is most efficient when the maximum number of nonzeros per row does not substantially differ from the average.

Figure 6: Example of ELL matrix format

Kernel for ELL matrix format is presented in the listing 6. With element padding of the ELL format, it’s easy to get the next row’s element position by simply adding the number of rows in the matrix. The padding also fixes the number of iteration for each thread, so there is no control flow divergence in warps. Elimination of control flow divergence and enabling of memory coalescing allow ELL SpMV kernel to outperform CSR-Scalar implementation on many matrices (tab. 5).

Listing 6: ELL implementation
Table 5: ELL speedup
ELL speedup (float)
ELL speedup (double)

The obvious disadvantage of ELL format consists of padding itself. In the case of a matrix with a few long rows, ELL format will result in an excessive number of padded elements. There are a lot of matrices in Florida Collection, that couldn’t fit into 8GB of my GPU because of ELL’s padding. In some cases, it leads to a situation where CSR-Scalar outperforms ELL implementation. To eliminate this issue, it’s possible to remove long rows’ extra nnz from ELL matrix into the different matrix. It is important to note that extracted matrix would have an unordered scheme. Many rows will likely be missing from that scheme, so CSR using would be inefficient. One of the formats that could handle that case is COO.

COO

The coordinate (COO) matrix format is the simplest one. For each NZ it stores it’s column and row indices. Therefore, COO doesn’t map elements in rows. That leads us to the necessity of atomic operations in COO kernel (list 7).

Listing 7: COO implementation
Table 6: COO speedup
COO speedup (float)
COO speedup (double)

COO SpMV implementation works at the granularity of threads per element (7). Atomic updates to the result vector reduce performance. The wider rows in COO format, the more serialized SpMV is. This fact can be noticed in figure 7. To improve the performance of this format, it’s possible to slice the matrix info chunks with the rows count that fits into shared memory.

Figure 7: The dependece of the COO speedup on the average nnz

Matrix format that uses shared memory to improve atomic operations performance in COO SpMV is called Sliced COO (SCOO). To reduce shared memory bank conflicts, SCOO allows multiple lanes in the shared memory for updating the intermediate results of a single row. Reducing slice size increases the size of lane, and thus more shared memory lanes are available.

Table 7: SCOO speedup

Hybrid

It’s possible to use ELL matrix format on the regular part of the matrix and COO on the elements removed from extra-long rows. This scheme significantly reduces the number of padded elements in ELL format. This
approach is often called as hybrid. There are different options for combining the results of ELL and COO SpMV. In this post I use atomic case (list. 8).

Listing 8: Hybrid implementation

Althought the average performance results (tab. 8) are quite close to CSR-Adaptive SpMV, Hybrid format requires extra actions for the splitting matrix, which might require rewriting of a matrix calculation code base.

Table 8: Hybrid speedup
Hybrid speedup (float)
Hybrid speedup (double)

Conclusion

To conclude this post, I would like to show you some misleading results. I’ve selected some matrices (tab. 9) to show the obvious fact that there is no universal matrix format. The leader changes even after data type change. In my next post, I’m going to focus on block matrix formats generated by real applications. Source code and pdf version of this post are available in github.

Table 9: Structure of the selected matrices
Speedup of the selected matrices (float)
Speedup of the selected matrices (double)

--

--

Georgii Evtushenko
Analytics Vidhya

C++ Software Developer with experience in high-performance computing.