Single-GPU CuPy Speedups
Array operations with NVIDIA GPUs can provide considerable speedups over CPU computing, but the amount of speedup varies greatly depending on the operation. The intent of this blog post is to visualize Chainer's CuPy performance for various different operations. We can definitely plug Dask in to enable multi-GPU performance gains, as discussed in this post from March, but here we will only look at individual performance for single-GPU CuPy.
Hardware and Software Setup
Before going any further, assume the following hardware and software is utilized for all performance results described in this post:
- System: NVIDIA DGX-1
- CPU: 2x Intel Xeon E5–2698 v4 @ 2.20GHz
- Main memory: 1 TB
- GPU: NVIDIA Tesla V100 32 GB
- Python 3.7.3
- NumPy 1.16.4
- Intel MKL 2019.4.243
- CuPy 6.1.0
- CUDA Toolkit 9.2 (10.1 for SVD, see Increasing Performance section)
We have generated a graph comprising various operations. Most of them perform well on a GPU using CuPy out of the box. See the graph below:
We have recently started working on a simple suite to help us visualize performance quickly and reliably, as well as to automate some plotting, it’s still incomplete and lacks documentation, which we intend to improve during the next days, and it is what was used to generate the plot above. If you’re interested in figuring out exactly how synthetic data was generated and the exact compute operation compared here, you can look at this file. This post won’t go into too much details of how this suite works right now, but we intend to write something about it in the near future, when it’s a bit more mature and easier to use.
As seen on the graph, we can get 270X speedup for elementwise operations. Not too bad for not having to write any parallel code by hand. However, the speedup is immensely affected by nature of each operation. We are not going to get too deep in why each operation performs differently in this post, but we might continue that on a future post.
Let us briefly describe each of the operations from the graph above:
- Elementwise: scalar operation on all elements of the array
- Sum: Compute sum of entire array, reducing it to a single scalar, using CUB, still under development
- Standard deviation: Compute standard deviation of entire array, reducing it to a single scalar
- Array Slicing: select every third element of first dimension
- Matrix Multiplication: Multiplication of two square matrices
- FFT: Fast Fourier Transform of matrix
- SVD: Singular Value Decomposition of matrix (tall-and-skinny for larger array)
- Stencil (Not a CuPy operation!): uniform filtering with Numba
It’s important to note that there are two array sizes, 800 MB and 8 MB, the first means 10000x10000 arrays and the latter 1000x1000, double-precision floating-point (8 bytes) in both cases. SVD array size is an exception, where the large size is actually a tall-and-skinny array of size 10000x1000, or 80MB.
When we first ran these operations, we actually saw a performance decrease in a couple of cases.
While it’s true that GPUs are not always faster, we did expect these operations in particular to be faster than their CPU counterparts. This had us puzzled.
Upon further investigation we found that both of these issues were either already fixed, or were actively being fixed by others within the ecosystem.
- SVD: CuPy’s SVD links to the official cuSolver library, which got a major speed boost to these kinds of solvers in CUDA 10.1 (thanks Joe Eaton for pointing us to this!) Originally we had CUDA 9.2 installed, when things were still quite a bit slower.
Note: Most of the results above still use CUDA 9.2. Only the SVD result uses CUDA 10.1.
- Sum: CuPy’s sum code was genuinely quite slow. However, Akira Naruse already had an active pull request to speed this up using CUB, a library of collection primitives for CUDA.
We learned a lot from the work described here. In particular it was gratifying to see that the performance issues we saw had already been identified and corrected by other groups. This highlights one of the many benefits of working in an open source community: things get faster without you having to do all of the work.
For better understanding of the scalability, it’s interesting to check the speedups for various other sizes. In case this passed unnoticed, there was no Dask speed evaluation, and this is definitely something that needs to be done as well!
It’s also important to have a standard way of comparing speed, for that, the suite should be improved, made more general-purpose and be properly documented as well.