Photo Generated by StableDiffusionXL depicting Flash & his lightning-fast speed.

Supercharging Data Science | Using GPU for Lightning-Fast Numpy, Pandas, Sklearn, and Scipy

Learn how to harness the full potential of your GPU to turbocharge Numpy, Pandas, and Sklearn, and save valuable time in your data science workflows.

Ahmad Anis
Red Buffer
Published in
7 min readMay 24, 2023

--

Table of Contents

  • Why GPUs
  • Numpy and Scipy on GPU using CuPy
  • Numpy on GPU using JAX
  • SKLearn on GPU via Rapids CuML
  • Pandas on GPU via Rapids CuDF
  • Conclusion

Why GPUs: Boosting Numpy, Pandas, and Scikit-learn Performance

As a data scientist or machine learning engineer, you’re likely familiar with Numpy, Pandas, and Scikit-learn — the essential libraries that underpin much of the data processing and modeling in Python. However, as datasets grow in size and complexity, the limitations of CPU-based computation become increasingly apparent. This is where GPUs come in. By harnessing the power of parallel processing, GPUs can dramatically accelerate the performance of Numpy, Pandas, and Scikit-learn, revolutionizing your data science workflows. In this article, we’ll explore why GPUs are the future of data science and how they can unlock the full potential of these essential libraries.

Photo by Nana Dua on Unsplash

Numpy and Scipy on GPU using CuPy

CuPy is a special type of computer program that helps you do complex math calculations much faster by using the power of a graphics processing unit (GPU). It is free and can be used with the Python programming language. CuPy takes advantage of the unique architecture of GPUs, which can perform certain types of calculations much faster than a regular computer processor.

Let’s install it.

$ pip install cupy

However installing with Pip would require you to set up NVIDIA Cuda ToolKit manually, which is a difficult task, so instead let’s just install it using conda-forge which automatically does this for us.

$ conda install -c conda-forge cupy

Here is a sample code that performs a simple dot product. The speed difference between Numpy and CuPy will shock you :)

import time
import cupy as cp
import numpy as np

# create a random numpy array of size 1000x1000
a = np.random.rand(1000, 1000)

# create a random cupy array of size 1000x1000
b = cp.random.rand(1000, 1000)

# start timer
start = time.time()

# perform some random operations on numpy array
for i in range(1000):
a = np.dot(a, a)

# print time taken
print("Time taken by numpy: ", time.time() - start)

# start timer
start = time.time()

# perform some random operations on cupy array
for i in range(1000):
b = cp.dot(b, b)

# print time taken
print("Time taken by cupy: ", time.time() - start)
Results: CuPy clearly outperforms Numpy

As you can see here, CuPy outperforms Numpy by a big margin. You can confirm the GPU usage of CuPy.

In [1]: print(b.device)
<CUDA Device 0>

Note: It’s important to note that for smaller arrays (such as a 1000x1000 array), it’s generally faster to use plain Numpy instead of CuPy. This is because the time it takes to convert the array from the CPU to the GPU can be larger than the time it takes for Numpy to perform the operation itself. However, for larger arrays, utilizing CuPy’s GPU-accelerated computing capabilities can result in a significant speed boost.

You can check the CuPy documentation for a SciPy implementation on GPUs.

Numpy on GPU using JAX

Jax is a Deep Learning framework designed by GoogleAI that focuses on 3 principles.

It is designed to have a NumPy-like API, so most if not all, of the Numpy functions, will be implemented in the JAX, with the same API style, but the difference would be that they will be running blazingly fast on GPUs/TPUs/Multiple GPUs, etc. Let’s have a look at the same example above and see the time difference between JAX, CuPy, & Numpy.

You can install JAX via the following command

$ pip install --upgrade pip

# CUDA 12 installation
# Note: wheels only available on linux.
$ pip install --upgrade "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

Let’s create some helper functions to check how JAX, Numpy, and CuPy perform a simple dot product.

def timeit(func):
def wrapper(*args, **kwargs):
start = time.time()
func(*args, **kwargs)
end = time.time()
print(f"Time taken by {func.__name__} is {end - start} seconds")
return wrapper


@timeit
def numpy_dot(a, b):
return a @ b

@timeit
def cupy_dot(a, b):
return a @ b

@timeit
def jax_dot(a, b):
return a @ b

Now we will simply create some arrays and call these functions.

import jax
import jax.numpy as jnp

# test numpy cupy and jax dot on a big array
a = np.random.rand(10000, 10000)
b = np.random.rand(10000, 10000)
numpy_dot(a, b)

a = cp.random.rand(10000, 10000)
b = cp.random.rand(10000, 10000)
cupy_dot(a, b)

key = jax.random.PRNGKey(0)
a = jax.random.normal(key, (10000, 10000))
b = jax.random.normal(key, (10000, 10000))
jax_dot(a, b)
Results: JAX & CuPy clearly outperforms Numpy

While the speed of JAX and CuPy is almost the same, JAX can be made super fast by adding its compilation and distributed features, which you can explore in detail from the documentation.

SKLearn on GPU via Rapids CuML

Rapids provide a library CuML that has a similar API design as Sklearn, but runs blazingly fast on GPUs. You can install CuML and Rapids by running this code if you are using colab, else follow the official installation guide which is out of the scope of this article due to complexity.

Here is a small example on a dummy dataset, which just shows the huge amount of time difference taken by sklearn and CuML just by adding GPU.

import numpy as np
from cuml import RandomForestClassifier as RandomForestClassifierGPU
import cupy as cp
from sklearn.ensemble import RandomForestClassifier
import time


# Generate some example data
X = np.random.rand(100000, 300) # 100000 rows, 300 columns
y = np.random.randint(0, 2, 100000)

# Convert data to CuPy arrays
X_gpu = cp.array(X)
y_gpu = cp.array(y)

Now simply fit the dataset on RandomForestGPU and simple RandomForest to see the huge time difference.

# Create and train a RandomForestClassifier on GPU
clf = RandomForestClassifierGPU()
start_time = time.time()
clf.fit(X_gpu, y_gpu)
end_time = time.time()

# Calculate time taken for training on CPU and GPU
time_taken_gpu = end_time - start_time

# Convert data back to Numpy arrays for CPU comparison
X_cpu = np.array(X)
y_cpu = np.array(y)

# Create and train a RandomForestClassifier on CPU
clf_cpu = RandomForestClassifier()
start_time = time.time()
clf_cpu.fit(X_cpu, y_cpu)
end_time = time.time()

# Calculate time taken for training on GPU
time_taken_cpu = end_time - start_time

# Calculate speedup using GPU
speedup = time_taken_cpu / time_taken_gpu

print(f"Time taken for training on CPU: {time_taken_cpu:.2f} seconds")
print(f"Time taken for training on GPU: {time_taken_gpu:.2f} seconds")
print(f"Speedup using GPU: {speedup:.2f} times")
Huge Speed Difference in using Random Forest

We can clearly see that using CuML has improved the performance by 230 times. As said earlier, on smaller datasets, this difference might be negligible or even the opposite, but on huge datasets, it makes a significant difference.

Pandas on GPU via Rapids CuDF

Just like Rapdis provide a Sklearn-like API for Machine Learning on GPUs, it also provides a Pandas-like API for Data Preprocessing and ETL on GPUs. If you have installed CuML in the previous step using the script, CuDF would automatically be installed. We can do a small comparison on a big dataset to see how it performs as compared to Pandas.

import pandas as pd
import cudf
import numpy as np
import time

# create a dummy dataset
df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list('ABCD'))

# convert to cudf
gdf = cudf.from_pandas(df)

# time the operation
start = time.time()
# use some operation that can be run in parallel
gdf['A'] = gdf['A'] + gdf['B']
end = time.time()
print(f'CuDf took {end-start} seconds')

start = time.time()
df['A'] = df['A'] + df['B']
end = time.time()
print(f'Pandas took {end-start} seconds')
CuDf clearly outperforms Pandas

Here we can see that on a simple operation like adding two columns of a Data Frame, CuDf (0.01 secs) clearly outperforms Pandas (0.15 secs).

Conclusion

In this article, you have learned how to leverage the power of GPU for your daily tasks using a tech stack similar to what you use daily (Numpy, Pandas, Scipy, Sklearn) and gain amazing speed up and save a lot of time.

It is important to understand that converting a dataset from CPU to GPU is an expensive process so you should be using it only once you have a sufficiently large enough dataset, or else you should rely on CPU since Numpy, Pandas, and Sklearn all are still efficient on CPU.

--

--

Ahmad Anis
Red Buffer

Deep Learning at Roll.ai, Researcher at Data Providence Initiative, Community Lead at Cohere for AI