Accelerated Python: CuPy

Faster Matrix Operations on GPUs

Dagang Wei
3 min readMay 29, 2024

This blog post is part of the series Accelerated Python.

Introduction

Matrix operations are fundamental in fields like data science, machine learning, and scientific simulations. While CPUs have long been workhorses for these calculations, modern GPUs offer immense potential for acceleration due to their parallel processing prowess. In this post, we’ll delve into using Python, CUDA, and your GPU to dramatically speed up matrix computations. We’ll focus on the seamless integration between NumPy and CuPy, and demonstrate the performance advantage with a benchmark example.

Why GPUs Excel at Matrix Operations

Central Processing Units (CPUs) are designed to execute instructions sequentially, one after another. Graphics Processing Units (GPUs), on the other hand, were conceived for highly parallel tasks like rendering complex scenes. They pack thousands of cores that can operate on different parts of a problem simultaneously. Since matrix operations can be broken down into many independent calculations, they are perfectly suited to harness the parallelism of GPUs.

CUDA: Bridging the Gap to GPU Programming

CUDA (Compute Unified Device Architecture) is NVIDIA’s platform for general-purpose GPU programming. It provides a rich set of tools and libraries, allowing you to write code that executes directly on the GPU. CUDA enables you to:

  • Offload Workloads: Shift computationally heavy tasks from the CPU to the GPU.
  • Optimize Data Transfers: Efficiently manage data movement between CPU and GPU memory.
  • Leverage Hardware Accelerators: Utilize specialized GPU units for operations like matrix multiplication and linear algebra routines.

Python’s Dynamic Duo: NumPy and CuPy

Python offers fantastic libraries for numerical computing:

  • NumPy: The cornerstone of numerical operations in Python, NumPy provides powerful array objects and a wide range of mathematical functions.
  • CuPy: The GPU-accelerated counterpart to NumPy. CuPy mirrors NumPy’s API, making it incredibly easy to port your existing NumPy code to run on the GPU. This seamless integration lets you tap into GPU acceleration with minimal changes to your codebase.

Benchmarking the Performance Difference

Let’s see a practical example of how CuPy outperforms NumPy on matrix multiplication. The code is available in this colab notebook.

import numpy as np
import cupy as cp
import time

# Matrix size (adjust for your hardware)
size = 2000
iterations = 100 # Number of iterations

# Create NumPy matrices
a_cpu = np.random.rand(size, size)
b_cpu = np.random.rand(size, size)

# Create CuPy matrices on the GPU
a_gpu = cp.asarray(a_cpu)
b_gpu = cp.asarray(b_cpu)

# Time CPU calculation
start_cpu = time.time()
for _ in range(iterations):
c_cpu = a_cpu @ b_cpu
time_cpu = time.time() - start_cpu

# Time GPU calculation
start_gpu = time.time()
for _ in range(iterations):
c_gpu = a_gpu @ b_gpu

# Ensure all GPU calculations have finished
cp.cuda.Stream.null.synchronize()

time_gpu = time.time() - start_gpu

print(f"CPU time (for {iterations} iterations): {time_cpu:.4f} seconds")
print(f"GPU time (for {iterations} iterations): {time_gpu:.4f} seconds")
print(f"Speedup: {time_cpu / time_gpu:.2f}x")

Output:

CPU time (for 100 iterations): 6.2800 seconds
GPU time (for 100 iterations): 0.1214 seconds
Speedup: 51.72x

In this example, we time matrix multiplications on both the CPU and GPU for multiple iterations. You’ll likely see a significant speedup with the GPU, especially for larger matrices.

Key Takeaways

  • CuPy’s Simplicity: CuPy’s API compatibility with NumPy makes transitioning your code remarkably straightforward.
  • Performance Gains: For larger datasets and computationally intensive operations, the GPU can offer substantial acceleration.
  • Data Transfer: Keep in mind that moving data between the CPU and GPU incurs overhead. For smaller operations, this overhead might outweigh the GPU’s computational benefits.

Conclusion

GPUs have revolutionized scientific computing, offering massive performance gains for tasks like matrix operations. With Python libraries like PyCUDA, Numba, and CuPy, harnessing this power has become more accessible than ever. By understanding the core concepts of CUDA and carefully optimizing your code, you can significantly accelerate your matrix computations and unlock new possibilities in your projects.

--

--