101 Ways to Store a Sparse Matrix

Max Grossman
18 min readOct 28, 2015

--

First, an aside: the motivation behind this post was some recent research in sparse matrix-dense vector multiplication, and the lack of an up-to-date plain English introduction to various sparse matrix formats. I write this post in the hope that it can be useful to other CS folks who want to better understand how sparse matrices are actually stored and the trade-offs between different formats. If you’re just curious, feel free to read the first section and stop there. If you want a wide-but-not-deep survey of modern sparse matrix formats, the full post will be useful. If you want a wide-but-also-deep survey, stay tuned for future posts on performance characterization of different formats.

All non-trivial software applications operate on collections of objects. These objects may represent customers, particles, anything with one or more attributes. The importance of storing and analysing large collections of objects has spawned fundamental fields in computer science like databases, data analytics, distributed systems, and more.

Objects are to software development as matrices are to applied math. Matrices are a unifying data representation. Matrices store a collection of items (one per row), with attributes for each item (across columns). They can be used to store relations between items, just as objects can store object references. The importance of matrices in real world scientific and engineering problems cannot be understated: they are the key abstraction that applied mathematicians use to express their problems across many domains.

Matrices can be divided into two classifications: dense and sparse. The “sparsity” of a matrix refers to how many non-zero values it stores, relative to the total number of cells. There’s no hard-and-fast threshold for when a matrix flips between being dense or sparse. The simplest approach we can take is to say a matrix is sparse when representing it in a storage format specialized for sparsity gets us some savings, in either space or time.

Let’s start by looking at how dense matrices are stored. If we’re talking in C, we can easily represent a dense matrix as an array:

double matrix[10][10];

Referencing matrix[0][0] gets us the leftmost element in the topmost row in the matrix.

If the size of a matrix is not fixed, we can use dynamic memory allocation to get the same effect by flattening the array:

double *matrix = (double *)malloc(nrows * ncols * sizeof(double));

(often you’ll see the number of rows in a matrix denoted by “M” and the number of columns denoted by “N”. I use nrows and ncols here to keep things straightforward for people not familiar with this convention).

So storing this dense matrix costs us nrows * ncols * sizeof(double) bytes. Say we have a square matrix with 1,000,000 elements on each side, and sizeof(double) is 8 bytes. Then we’re paying 8TB to store this matrix in memory. With machines today topping out at around 1TB of memory it isn’t inconceivable that in the near future storing an 8TB matrix in memory will be possible. But for most of us mere mortals with a few GB of memory in our laptops, this dense representation really isn’t viable for large matrices.

Fortunately, many real-world matrices are sparse, i.e. contain many more zeros than non-zeros. Using specialized formats for sparse matrices can gain you both significant time and space savings over naively storing them in a dense format.

The simplest sparse format is the Coordinate format (COO). In COO you represent a matrix with three arrays: a rows array, a columns array, and a values array. For every non-zero value in the original sparse matrix, there is an entry at index i in the rows array, columns array, and values array that stores the row, column, and value of that non-zero item. Suppose you have the matrix below:

                              columns
0 1
row 0 [ 1.0 2.0 ]
row 1 [ 0.0 4.0 ]

Storing this matrix in COO would produce the following three arrays:

unsigned rows[3] =    { 0,   0,   1   };
unsigned columns[3] = { 0, 1, 1 };
double values[3] = { 1.0, 2.0, 4.0 };

indicating that at cell (0, 0) the value 1.0 is stored, at cell (0, 1) the value 2.0 is stored, and at cell (1, 1) the value 4.0 is stored. Note that we store no explicit information on (1, 0) and so its value is implicitly set to 0.0.

Now say I asked for the value stored at cell (1, 1) in the original matrix. Finding it is a simple matter of iterating over the entries in rows and columns, checking at each if rows[i] == 1 and columns[i] == 1. If they both match the target coordinates, the result is stored in values[i]. If no exact match is found, the result must be 0.0.

Let’s go back to our massive 1,000,000 x 1,000,000 matrix and guess that 5% of all cells have a non-zero value in them, meaning 50,000,000,000 non-zero values in total. Using COO (and assuming sizeof(unsigned) == 4), storing this massive matrix then costs us:

(sizeof(unsigned) + sizeof(unsigned) + sizeof(double)) * 5E10
=> 16 bytes * 5E10
=> 800 GB

That’s a pretty significant savings over the 8TB we had to pay with the dense format! Note that the size of the COO format doesn’t scale linearly with sparsity, 5% non-zeroes only brought the matrix size down to 10% of the dense format’s size. Still, at least now we’re only two orders of magnitude above what a modern laptop can store!

We can even derive the tipping point where using the COO format saves space relative to the dense format, and use that as a working definition for when a matrix is “sparse” vs. “dense”. We need to find when storing 16 bytes for each non-zero value is cheaper than storing 8 bytes for every cell, or:

                 16 * nnz < 8 * nrows * ncols

where nnz is a common shorthand for the total “number of non-zeroes” in a matrix. If we also define a value sparsity that is the percentage of cells that are non-zero, we can derive:

16 * (sparsity * nrows * ncols) < 8 * nrows * ncols
=> sparsity * nrows * ncols < 0.5 * nrows * ncols
=> sparsity < 0.5

Whenever the sparsity of a matrix is below 50% (i.e. less than half of its values are non-zero), we can get space savings from storing it in the sparse COO format, relative to the dense format. Note that while this property does not vary with nrows or ncols, it does vary with the # of bytes needed to store a value in each format. This relation could be more abstractly defined as:

                   # of bytes per value in dense
sparsity < -----------------------------
# of bytes per value in COO

So COO is great because it allows us to save space on some classes of sparse matrices, relative to the dense format. But what’s wrong with COO? Well, a lot of things as it turns out. First of all, the space to store each element has doubled. Sure, we get space savings on some matrices, but those 16 bytes per non-zero still seem wasteful. Especially when all of the extra space is being used to store data that is going to be very repetitive: all of the values stored in the rows array are going to be limited to being >=0 and < nrows, with a similar property holding in the columns array. Additionally, say I asked you to give me the value stored at cell (980, 1020). If we don’t assume any sorting of the rows or columns arrays, that single lookup will require looking at every single non-zero value. In the dense format, you would just jump to offset 980 * ncols + 1024 in your array and be done with it.

Fortunately, applied mathematicians and computer scientists don’t believe in too many choices ever being a bad thing. Because sparse matrices are so important, there has been decades of work put into solving this problem of storing, accessing, and computing on them efficiently. As a result, there’s a huge amount of choice available in storage formats for sparse matrices. The remainder of this post will briefly explain the most common formats in use today, as well as the reasoning/intuition behind using one or the other. In particular, we’ll look at Compressed Row (CSR), Block Compressed Row (BSR), Diagonal (DIA), and Ellpack-Itpack (ELL). Some less significant (but still important) formats which we’ll be skipping are Compressed Column Storage (CSC), Sliced ELL (SELL), Extended BSR (BSRX), and Hybrid (HYB). Understanding each of these will be straightforward once you understand CSR, BSR, DIA, and ELL.

Before diving into these formats, I need to apologize for being a bit hand wavy about the efficiency of actually working with these formats for useful things (like matrix-matrix multiply, matrix-vector multiply, etc). We’ll reason about the complexity of a lookup in these different formats and their space efficiency, but understanding the performance trade-offs will require another post and some real performance numbers on different sparse matrix implementations. However, the information below is a comprehensive introduction to modern sparse matrix formats and a necessary prerequisite to talking about and reasoning about their performance in the real world.

CSR: CSR is a simple, incremental improvement over COO. One of the most wasteful parts of COO was its storage of columns and rows. Say you have a three-row matrix and you store it in COO:

unsigned rows[5]    = { 0,   0,   0,   1,   2   };
unsigned columns[5] = { 0, 3, 5, 0, 3 };
double values[5] = { 3.0, 4.0, 1.0, 1.0, 2.0 };

That’s a lot of redundant data being stored in rows. We’re using 5 integers to differentiate between just three values: row 0, 1, or 2. How can we compress this? Well, how about keeping columns and values sorted by row, but only storing the number of items in each row, rather than the row index for each item:

unsigned row_counts[3] = { 3,             1,   1 };
unsigned columns[5] = { 0, 3, 5, 0, 3 };
double values[5] = { 3.0, 4.0, 1.0, 1.0, 2.0 };

Nice, that saved us two integers! Now, row_counts[i] tells us how many non-zero values are in row i, so row_counts only has to be as long as nrows rather than nnz. As long as the average number of non-zero values per row is >1 that should save us space.

Now, what if I asked you to do a lookup on this format? Say we want to fetch the value at cell (2, 3). First, you’d need to figure out where row 2 starts in the columns and values arrays. You can do that by recognizing that the values belonging to row i in columns and values start where the values of row i-1 end, i.e. at offset sum(row_counts[0:i-1]). So to find the start of row 2 you would sum row_counts[0] and row_counts[1] to get 4. You’d then go to offset 4 in columns and check the value there. It matches the column we’re searching for (3), so you’d fetch values[4] and return that.

To do that lookup, you had to scan through all of row_counts to figure out the offset of values from the target row in the columns and values arrays. That scan is going to get pretty expensive on our matrix with 1,000,000 rows. Instead, what if we just store the row offsets directly, so that calculating the offset of row i in columns or values just requires checking row_offsets[i]:

unsigned row_offsets[4] = { 0,             3,   4 };
unsigned columns[5] = { 0, 3, 5, 0, 3 };
double values[5] = { 3.0, 4.0, 1.0, 1.0, 2.0 };

In this format, row_offsets[i] == sum(row_counts[0:i-1]). Now, looking for (2, 3) only requires checking row_offsets[2] to find the offset of row 2 in columns and values. The rest of the process is the same as above.

And that’s CSR! It is identical to COO in its storage of columns and values, but stores offsets of each row in columns and values, rather than the actual row index for each non-zero item (as in COO). This is great for a couple of reasons. First, it limits the length of row_offsets to just nrows. As long as the average number of non-zeroes per row is > 1, this saves us space relative to COO. CSR also allows us to quickly find all of the values belonging to a row i by just jumping to offset row_offsets[i] in columns and values. This is much more efficient than the scan we had to do in COO.

One minor extension that all standard libraries for CSR use is to add one element to the end of row_offsets: the total number of non-zero entries in the matrix. This extension would change our previous row_offsets to:

unsigned row_offsets[5] = { 0,             3,   4, 5 };

While this change adds one integer to the row_offsets array, it allows us to calculate the length of row i using row[i+1] - row[i] without special-casing for the last row. Additionally, since the number of non-zero entries in the matrix is a value you need to store either way, this doesn’t really cost us any space. We simply store nnz in row_offsets[nrows] rather than in a separate scalar variable.

BSR: We saw that CSR was an incremental improvement/extension over COO that enables more efficient storage of row information and faster lookups. Likewise, BSR is an extension to CSR that focuses on optimizing the performance of computation with these matrices, at the cost of some space. We won’t really go into the details of why BSR might be faster than CSR beyond saying that it aims to improve the regularity, locality, and load balance of common matrix computations.

Say we’re given a matrix of size 4 x 4:

                             [ 1 2 0 0 ]
[ 1 0 0 0 ]
[ 0 0 4 5 ]
[ 0 0 3 6 ]

We can clearly store this in a dense, COO, or CSR format. However, we can also observe that there’s some structure to this matrix. In particular, there are two non-zero “sub-matrices” of size 2 x 2 starting at (0, 0) and (2, 2). It turns out that this blocked pattern is common in real world matrices. BSR is a format optimized to handle matrices that exhibit this blocking.

BSR starts by splitting the input matrix into blocks of size B x B. For this example matrix, a natural block size would be B=2. Doing this conversion on the example matrix gives us:

                          [ 1 2 ]   [ 0 0 ]
[ 1 0 ] [ 0 0 ]
[ 0 0 ] [ 4 5 ]
[ 0 0 ] [ 3 6 ]

We can think of this new matrix as a 2 x 2 matrix of matrices. BSR then takes this matrix of matrices, and stores it in CSR format! First, it uses a row_offsets array and columns array just like CSR to store the indices of non-zero blocks in the blocked matrix, instead of the original matrix. Since blocks (0, 1) and (1, 0) are all zeroes in the blocked matrix we ignore them just like we ignore zero cells in CSR.

unsigned block_size = 2;
unsigned block_row_offsets[3] = { 0, 1, 2 };
unsigned block_columns[2] = { 0, 1 };

These arrays tell us that we have two rows-of-blocks in our matrix, each with one non-zero block. The non-zero block in block-row 0 is in block-column 0, and the non-zero block in block-row 1 is in block-column 1. block_size tells us that each row-of-blocks has a height of 2, and each column-of-blocks has a width of 2.

The values array in BSR stores all of the values in each non-zero block (even if there are some zeros!). Since we’re in C, we’ll store the blocks in row-major order here:

unsigned block_size = 2;
unsigned block_row_offsets[3] = { 0, 1, 2 };
unsigned block_columns[2] = { 0, 1 };
double values[2 * block_size * block_size] =
{ 1.0, 2.0, 1.0, 0.0, // block (0, 0) in row-major order
4.0, 5.0, 3.0, 6.0 } // block (1, 1) in row-major order

Note that even though the last element of block (0, 0) is a zero, we still store it in BSR to keep all block sizes equal. As a result, you may actually end up having to store some zero values in BSR that you didn’t in CSR. We really shouldn’t think about BSR as a space optimization over CSR or COO, but rather as a time optimization. We won’t look at the performance benefits in detail here, but we will dive into them in a later post.

Now if I asked for value (2, 3) in the original matrix, the lookup is a two-step process. First, we need to find the block that (2, 3) belongs to. Then, we need to jump to that block (if it exists) and find the intr-block offset that stores (2, 3). If the block doesn’t exist, the value must be 0.0.

Finding the block is straightforward, simply do an integer divide of the target (row, column) coordinates by the block size to get the (block-row, block-column) coordinates of its containing block. This produces (1, 1) for (2, 3), given a block size of 2.

We can then jump to the values stored in block (1, 1) using block_row_offsets, block_columns, and the techniques we learned for CSR lookups. In particular, we find that the values for block (1, 1) are stored at an offset of 1 * block_size * block_size in values. We then calculate an intra-block coordinate of (0, 1) by subtracting the block’s starting coordinate of (2, 2) from the target coordinate of (2, 3), and calculate the intra-block offset in row-major order as 0 * block_size + 1. Combining the block offset of 1 * block_size * block_size and the intra-block offset of 0 * block_size + 1 we get an offset in values of block_size*block_size+1 = 5. values[5] stores 5.0, which we can check against the original matrix as the correct value for cell (2, 3).

This lookup is a bit more complex to understand and implement than looking up values in a CSR matrix, but in reality is no more inefficient and still better than COO.

DIA: We’ve already covered several matrix storage formats, so let’s put them in context. The dense format is the most data-agnostic format we’ve seen so far. It makes no assumptions about what it’s storing, and simply stores everything. COO and CSR are less agnostic; data stored in the COO or CSR format should be sparse, and if it isn’t you’ll end up being hit by lost efficiency. BSR constrained the problem even more by focusing on sparse matrices that are sparse in a blocked pattern.

Now we come to the most constrained format yet, DIA. DIA (or Diagonal) focuses on efficiently storing and computing on sparse matrices that are heavily diagonal, i.e. most of the non-zero values in the matrix are stored along diagonals:

                           [ 1 2 0 0 ]
[ 3 1 2 0 ]
[ 0 3 1 2 ]
[ 0 0 3 1 ]

At a high level, DIA takes the original matrix and creates a new, transformed matrix that stores each non-zero diagonal of the original matrix as a row of the new matrix. The matrix above has three non-zero diagonals:

  1. One diagonal starting at (0,0) and terminating at (3,3), filled with 1s and with 4 entries.
  2. One diagonal starting at (0,1) and terminating at (2,3), filled with 2s and with 3 entries.
  3. One diagonal starting at (1,0) and terminating at (3,2), filled with 3s and with 3 entries.

To store it in DIA, the matrix above would be translated into a new matrix with 3 rows (for 3 non-zero diagonals) and 4 columns (for a maximum diagonal length of 4):

                            [ 0 3 3 3 ]
[ 1 1 1 1 ]
[ 2 2 2 0 ]

Zeros are padded to diagonals stored in the new matrix when they are shorter than the maximum diagonal length. In addition to an array to store this matrix, the only other array needed is a diagonal_ids array that stores unique diagonal IDs for each non-zero diagonal. These diagonal IDs and the stored non-zero diagonals are sufficient to recover the original matrix:

unsigned max_diagonal_length = min(nrows, ncols);
unsigned diagonal_ids[3] = { -1, 0, 1 };
double values = { 0.0, 3.0, 3.0, 3.0,
1.0, 1.0, 1.0, 1.0,
2.0, 2.0, 2.0, 0.0 }

diagonal_ids[i] stores the diagonal ID for the diagonal stored in the ith row of values. The diagonal starting at (0, 0) is usually given diagonal ID zero. All diagonals beneath it are negative, decreasing as they become further away from the zero diagonal. All diagonals above it are positive, increasing as they become further away. These diagonal IDs makes it easy to calculate the diagonal a given set of coordinates belong to. Cell (r,c) belongs to diagonal c-r. For example, cell (2,1) in the original matrix belongs to diagonal c-r = 1–2 = -1.

Lookups for a coordinate (r,c) in the original matrix are surprisingly straightforward in the DIA format. First, we check for a match between the diagonal ID c-r and any entries in diagonal_ids. As you should only store highly diagonal matrices in DIA format, the length of diagonal_ids should be small, keeping this scan fast. If a match is found at diagonal_ids[i], the value at cell (r,c) in the original matrix is stored in values in row i, column r.

We’ve already stated that the DIA format is the least flexible format we’ve looked at so far. How bad could it be? Suppose we stored our 1,000,000 x 1,000,000 matrix in the DIA format, and suppose entry (0,999999) was non-zero. Sitting in the top right corner of the matrix, (0,999999) is a single-element diagonal. However, because we are storing in DIA format we have to store a full 1,000,000 entries for every diagonal, no matter how long it actually is. This leads to a waste of 999,999 * 8 bytes for just that one diagonal, for something that could be stored in 16 bytes in COO.

However, using the DIA format for extremely diagonal matrices can produce both space savings and more regular, well-performing computation than the formats we’ve looked at so far.

ELL: We now come to the final format, ELL. Like DIA, ELL is a relatively specialized format with few similarities to CSR, COO, and BSR. At a high-level, ELL packs a matrix into as few columns as is allowed by the densest row. Consider the matrix below:

                         [ 1 2 3 0 ]
[ 0 4 5 0 ]
[ 0 6 0 7 ]
[ 8 0 0 0 ]

Converting this matrix to ELL starts by finding the row with the most non-zero values. In this case, row 0 has 3 non-zero values. ELL creates two arrays to store this matrix. A values array is created with the dimensions nrows x max_nnz_per_row, where max_nnz_per_row is the value determined by the previous step. A columns array is created with the same dimensions. ELL packs each row of the original matrix into the corresponding row of the values matrix by deleting all zeroes from the original row, packing the remaining non-zero values into the leftmost cells ofthe new row, and padding any leftover space in that row with zeroes. ELL stores the original column indices for each value in the columns array so that they can be recalled later. Doing this transformation on the matrix above would produce the following two matrices:

   values = [ 1 2 3 ]             columns = [ 0  1  2 ]
[ 4 5 0 ] [ 1 2 -1 ]
[ 6 7 0 ] [ 1 3 -1 ]
[ 8 0 0 ] [ 0 -1 -1 ]

which could then be stored in arrays in row-major or column-major order.

Doing a lookup in ELL format is relatively straightforward: find the same row in the new columns array and scan it for the correct column number. If it is found, return the corresponding cell in the values array. If not, the value must be zero.

While ELL may be somewhat space inefficient by storing many unnecessary zeroesy, past work has shown that ELL performs well on vector architectures, such as CPUs with vector units or GPUs. Generally, people storing their matrices in ELL are optimizing for these types of architectures.

There are a huge variety of sparse matrix storage formats available, and we’ve only covered a fraction here. The correct choice of sparse matrix format depends on your hardware (architecture, storage, memory, …), the sparse matrix library being used (MKL, Trilinos, CUSPARSE, CUSP, …), and the characteristics of your matrices (non-zeroes per row, sparsity, …). In a future post, the background provided on sparse matrices in this post will be used to reason about the observed performance of different formats when using various software implementations of sparse matrix computation running on different types of hardware. For more information in the meantime, see below for research papers studying the performance of different matrix formats on different architectures.

Recommended Readings (in no particular order):

  1. Bell, Nathan and Garland, Michael. “Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors”.
  2. Williams, Samuel and Oliker, Leonid and Vuduc, Richard and Shalf, John and Yelick, Katherine and Demmel, James. "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms".
  3. Liu, Xing and Smelyanskiy, Mikhail and Chow, Edmond and Dubey, Pradeep. “Efficient Sparse Matrix-Vector Multiplication on x86-Based Many-Core Processors”.
  4. Davis, John D and Chung, Eric S. “SpMV: A Memory-Bound Application on the GPU Stuck Between a Rock and a Hard Place”.
  5. Byun, Jong-Ho and Lin, Richard and Yelick, Katherine A and Demmel, James. “Autotuning Sparse Matrix-Vector Multiplication for Multicore”.

For those interested in a more programmatic representation of sparse matrix formats, below are some structs that could be used to minimally represent each format discussed above. For simplicity C arrays are used. In practice (e.g. when loading matrices from a file), these structures would be dynamically allocated.

// DENSE
typedef struct dense_matrix_ {
int nrows;
int ncolumns;int nrows
double values[nrows][ncolumns];
} dense_matrix;
// COO
typedef struct coo_matrix_ {
int nnz;
unsigned rows[nnz]; // row index for each non-zero value
unsigned columns[nnz]; // column index for each non-zero value
double values[nnz]; // each non-zero value
} coo_matrix;
// CSR
typedef struct csr_matrix_ {
int nrows;
int nnz;
// The offset in columns, values of each row's representation
unsigned row_offsets[nrows + 1];
unsigned columns[nnz]; // same as COO
double values[nnz]; // same as COO
} csr_matrix;
// BSR
typedef struct bsr_matrix_ {
int nrows;
int ncolumns;
int block_size; // evenly divides nrows and ncolumns
int nnzb; // # of non-zero *blocks*
unsigned block_row_offsets[nrows / block_size + 1];
unsigned block_columns[nnzb];
double values[nnzb][block_size][block_size];
} bsr_matrix;
// DIA
typedef struct dia_matrix_ {
int nrows;
int ncolumns;
int max_diag_length = min(nrows, ncolumns);
int n_non_zero_diagonals;
int diagonal_ids[n_non_zero_diagonals];
double diagonal_values[n_non_zero_diagonals][max_diag_length];
} dia_matrix;
// ELL
typedef struct ell_matrix_ {
int nrows;
int max_nnz_per_row;
int columns[nrows][max_nnz_per_row];
double values[nrows][max_nnz_per_row];
} ell_matrix;

--

--

Max Grossman

PhD student at RiceU working on HPC, parallel computing, GPUs, and education : jmg3.web.rice.edu, @_max_grossman