Predefined sparsity

Sparse Matrices in Pytorch

Part 2: GPU runtimes

Sourya Dey
TDS Archive

--

In part 1, I analyzed the execution times for sparse matrix multiplication in Pytorch on a CPU. Here’s a quick recap:

  • A sparse matrix has a lot of zeroes in it, so can be stored and operated on in ways different from a regular (dense) matrix
  • Pytorch is a Python library for deep learning which is fairly easy to use, yet gives the user a lot of control. Pytorch stores sparse matrices in the COOrdinate format and has a separate API called torch.sparse for dealing with them.
  • The CPU I used was my own Macbook Pro — mid 2014 with a 2.2 GHz Intel Core i7 processor and 16 GB of RAM.
  • The key finding from part 1 was: 2 dense matrices always multiply faster than a sparse and dense matrix unless the sparse matrix has very low density. ‘Very low’ seems to be 1.5% and below.

One of the key selling points of deep learning frameworks such as Pytorch and Keras is their deployability on GPUs, which massively speeds up computation. This comes with its limitations though, Pytorch only supports CUDA-compatible GPUs. CUDA is a computing framework created by Nvidia which leverages parallel processing on compatible GPUs. Not all GPUs are, for example, my Macbook has Intel Iris Pro Graphics which, unfortunately, is not CUDA-compatible. So I frequently use Amazon Web Services (AWS) — which offers cloud computing resources. For the experiments in this article, I used an AWS p3.2xlarge instance which has 1 Nvidia V100 GPU and is pretty powerful (as an example, training a 11-layer deep CNN on CIFAR-100 is 100 times faster on the p3 as compared to my Macbook).

I repeated the same 3 experiments from Part 1 on the p3 instance with some minor changes, and compared these results to the CPU results. Let’s dive in!

Both size and density varying

As in part 1, all matrices are square. The sparse case is multiplying a diagonal matrix with a dense matrix. This means the density of the sparse matrix is 1/n (n = #rows). The dense case is multiplying 2 dense matrices, however, both dense matrices are randomly generated, unlike Part 1 where one dense matrix was simply the ‘densified’ version of the sparse matrix. This does not affect the results qualitatively; I only did this because it made the code for timing easier to setup (more on this later). Here are the results:

Accessing overheads dominate the inset figure showing lower values of n on a magnified y-axis. Barring these, the trends are clear — a) the GPU is much faster than the CPU, and b) sparse matrices execute faster than dense, due to the extremely low densities in diagonal matrices.

Density fixed, size varying

The sparse matrices now have a fixed density of 12.5%, while the dense cases are the same as before. Here are the results:

While GPUs again perform better overall, this time it’s the sparse cases which take more time than their dense counterparts. Which brings us to the final experiment — the impact of density on sparse matrix computation times. As a side note, note that sparse matrices on GPUs seem to have very large overheads which totally dominate for low values of n. Whether this is a limitation of GPUs or is due to the COO format is unknown to me.

Size fixed, density varying

While low-density sparse diagonal matrices are faster to operate on than dense, this result flips for higher density sparse matrices. This experiment varies the density level for a sparse matrix of a given size and measures execution time. The sizes (values of n) I used are a bit different from part 1. Here they are 64, 256, 1024, and 4096. The density is swept through powers of 2. As in part 1, the red dashed line shows computation time for the dense case, while the black dashed lines indicate densities for the previous two experiments — diagonal matrices and 12.5%.

The curves for n = 64 and 256 behave a bit funnily near the beginning, showing linear-ish growth for very low density values. The only explanation I can think of this is that the memory where the sparse matrix is stored is accessed differently when the number of non-zero elements is very low. It seems like the accessing overheads don’t set in until an appreciable density is reached, then they dominate for a while (the flat portion), before yielding to computation times (exponential portion).

Nevertheless, we note once again that there are no practical gains to be had by using sparse matrices instead of dense. For n = 64, 256 and 1024, sparse multiplication always takes more time than dense, while for n = 4096, it only takes about 1.5% density for sparse multiplication to take more time than dense.

In summary, there are two major takeaways from this article. Firstly, GPUs (at least the modern and powerful ones) are significantly faster than CPUs and should always be used for linear algebra if possible. I say ‘if possible’ because AWS instances are not free. However, if you are a funded researcher, convincing your supervisor to pay for AWS GPU instances is well worth it if you’ll be doing linear algebra intensive tasks such as deep learning. Secondly, the major finding from part 1 is reinstated on GPUs as well , i.e. 2 dense matrices always multiply faster than a sparse and dense matrix unless the sparse matrix has very low density (< 1.5%).

Thus, there doesn’t seem to be too much benefit to using sparse matrices in Pytorch. As far as my research on pre-defined sparsity is concerned, densities of 1.5% and below are too low to be of any use. These issues with sparse matrices have been expressed by other Pytorch users, so here’s hoping that the devs come up with more efficient ways of handling sparse matrices.

Measuring execution times

The code and images for this article can be found on Github here. If you browse the code, you’ll find that I used the timeit module provided by Python for measuring execution time of code snippets. For the most accurate time measurements, it is good practice to setup all variables in the setup argument to timeit.Timer(). This is as opposed to defining variables beforehand and then giving timeit.Timer() access to the global namespace, which wastes time.

As an alternative to the timeit module, Jupyter notebooks provide the %timeit magic command, which is what I used in part 1. The results of the 2 approaches are the same (they had better be!), but %timeit is slightly easier to use since it automatically takes care of averaging across many runs. However, running Jupyter notebooks on AWS, while possible, is cumbersome and a lot could depend on the server speed. As a result, I stuck to good old Python scripting for part 2.

That’s all for now!

Sourya Dey is pursuing a PhD at the University of Southern California. His research deals with exploring complexity reduction in deep learning. You can read more about him on his website.

--

--

Responses (1)