Accelerate Your Machine Learning: A Hands-On Introduction to RAPIDS in Python

Ahmedabdullah
Tensor Labs
Published in
7 min readNov 20, 2023
Photo by Johannes Plenio on Unsplash

Hey, there fellow ML Engineer! It’s been a while since I published my last article but now that I’m back I plan on making full use of the knowledge I gathered during the period of my absence. One of the problems I came accross during this period was speed of many machine learning libraries. Even though GPU was available but since those libraries do not natively support GPU it’s like not having GPU at all. For example pandas, which is an amazing library otherwise and performs amazingly too but sometimes dealing with large dataset even pandas native operations slow down. I was frustrated by this and set out on a journey to find solutions that can speed up python native libraries by utilizing GPU and hence I landed at RAPIDS.

RAPIDS is an open-source suite of data science and machine learning libraries designed to accelerate the end-to-end data science and analytics workflow on NVIDIA GPUs. The name “RAPIDS” stands for “Rapid Analytics on Platforms In Data Science.” The library provides GPU-accelerated implementations of common data science and machine learning algorithms, offering significant performance improvements over traditional CPU-based approaches.

Understanding RAPIDs

RAPIDS aims to seamlessly integrate with popular Python data science libraries like NumPy, pandas, and scikit-learn, allowing users to transition to GPU-accelerated workflows with minimal code modifications. By harnessing the parallel processing capabilities of NVIDIA GPUs, RAPIDS significantly speeds up computations, making it especially valuable for working with large datasets and computationally intensive tasks in data science and machine learning.Some Key components of the RAPIDS include:

  1. cuDF (GPU DataFrame): This component provides a GPU-accelerated DataFrame library similar to pandas, enabling efficient data manipulation and analysis on large datasets.
  2. cuML (GPU Machine Learning): cuML offers GPU-accelerated implementations of machine learning algorithms, including linear regression, k-means clustering, decision trees, and more. This allows data scientists to train models faster and handle larger datasets.
  3. cuGraph (GPU Graph Analytics): Designed for graph analytics, cuGraph provides GPU-accelerated graph algorithms, making it suitable for tasks involving network analysis and graph-based computations.
  4. cuSpatial (GPU Spatial Operations): This component focuses on GPU-accelerated spatial operations, providing efficient tools for geospatial analytics and spatial data processing.
  5. cuSignal (GPU Signal Processing): cuSignal is dedicated to GPU-accelerated signal processing, making it beneficial for tasks related to signal analysis and processing.

Common use-cases where RAPIDs is a life savior?

Large-Scale Data Manipulation:

  • Challenge: Handling and manipulating large datasets efficiently using traditional CPU-based methods can be slow and resource-intensive.
  • RAPIDS Solution: RAPIDS’ cuDF provides a GPU-accelerated DataFrame library, enabling rapid and efficient data manipulation on large-scale datasets, leading to significantly faster processing times.

Machine Learning Model Training:

  • Challenge: Training complex machine learning models on substantial datasets can be time-consuming on CPUs, limiting the scalability of model development.
  • RAPIDS Solution: cuML in RAPIDS accelerates machine learning workflows by providing GPU-accelerated implementations of common algorithms. This results in faster model training times, allowing data scientists to iterate and experiment more quickly.

Graph Analytics:

  • Challenge: Analyzing and processing large-scale graphs or networks with traditional methods can be computationally intensive and time-consuming.
  • RAPIDS Solution: cuGraph enables GPU-accelerated graph analytics, making it a lifesaver for tasks such as social network analysis, fraud detection, and recommendation systems by significantly speeding up graph-based computations.

Geospatial Analytics:

  • Challenge: Analyzing and processing geospatial data for applications like mapping, location-based services, and environmental monitoring can be slow using conventional CPU-based methods.
  • RAPIDS Solution: cuSpatial in RAPIDS provides GPU-accelerated spatial operations, allowing for faster and more efficient geospatial analytics. This is particularly beneficial for handling large-scale geospatial datasets.

How easy is this utilization & How can we do it?

I personally was only able to use cuML and cuDF for the tasks where I needed RAPIDs but feel free to explore other utilization of the library for various use-cases. Now that we know what RAPIDs is and what it does I think the next question that we might be asking is how?, well no need to be worried about it. The remaining section of this article explains how you can get started with RAPIDs and utilize the GPU for everyday python tasks which have got you waiting for long hours. So without further ado, let’s get started.

Setting Up Your Environment

Before diving into RAPIDS, it’s crucial to set up our environment to take advantage of GPU acceleration. We need to ensure that you we the latest NVIDIA GPU drivers installed on our system. Visit the official NVIDIA website to download and install the appropriate drivers for your GPU.RAPIDS relies on the CUDA toolkit for GPU acceleration. Install the CUDA toolkit by following the instructions on the official NVIDIA website.RAPIDS provides pre-built conda packages for easy installation. Create a conda environment and install RAPIDS using the following commands:


conda create -n rapids python=3.8
conda activate rapids
conda install -c rapidsai -c nvidia -c conda-forge -c defaults cuml=21.12 python=3.8
conda activate rapids

Now that we have got our libraries set-up let’s find out how easy the learning curve for you is going to be, and the similarities are more mind-blowing than what you would have imagined.

Transitioning from pandas to cuDF

Let’s see a few cases where cuDF speeds up the pandas native operations leveraging the GPU capabilities.

DataFrame Creation

Let’s kick things off with the basics — creating DataFrames. Pandas users, you’ll feel right at home here. The syntax for creating DataFrames in cuDF is remarkably similar to what you’re already familiar with.

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']}

pandas_df = pd.DataFrame(data)

Now, let’s do the same with cuDF:

import cudf

data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'San Francisco', 'Los Angeles']}

cudf_df = cudf.DataFrame(data)

Looks familiar, right? The transition is seamless, allowing you to leverage your existing Pandas knowledge. Let’s have a look at few more examples

Data Manipulation:

One of the key strengths of both Pandas and cuDF is their powerful data manipulation capabilities. From selecting columns to filtering rows, the syntax remains consistent.

# Selecting columns
pandas_selected_columns = pandas_df[['Name', 'Age']]
cudf_selected_columns = cudf_df[['Name', 'Age']]

# Filtering rows
pandas_filtered_rows = pandas_df[pandas_df['Age'] > 30]
cudf_filtered_rows = cudf_df[cudf_df['Age'] > 30]

Missing Data Handling

Handling missing data is a crucial aspect of data analysis. Both libraries offer similar functionalities for dealing with NaN values.The syntax for handling missing data is almost identical, making it easy to switch between the two libraries.

# Dropping NaN values
pandas_no_nan = pandas_df.dropna()
cudf_no_nan = cudf_df.dropna()

# Filling NaN values
pandas_filled = pandas_df.fillna(0)
cudf_filled = cudf_df.fillna(0)

GroupBy Operations:

Grouping data for aggregation is a common operation in data analysis. Pandas and cuDF share a common syntax for GroupBy operations.

# GroupBy and aggregation
pandas_grouped = pandas_df.groupby('City').mean()
cudf_grouped = cudf_df.groupby('City').mean()

Now that we have seen how easy it is to tranistion between pandas and cuDF let’s see an example where I utilized cuDF capabilities and RAPIDs to speed up manifold creation.

Manifold Creation Example (Umap and plotly)

In AI, datasets often exist in spaces with numerous variables, making direct analysis challenging. Manifold creation involves mapping these high-dimensional datasets onto lower-dimensional structures, preserving essential relationships and intrinsic features

Manifold creation in AI is a crucial concept that addresses the complexity of representing and understanding data in multiple dimensions. Many of us have often come across problems like this where we find ourselves with data that has multiple dimensions and we need to find ways to visualize them.Manifolds are mathematical abstractions that describe the underlying structure of high-dimensional data, helping AI systems navigate and learn from complex patterns.

Common libraries used for creating 3d features or embedding include T-SNE and UMAP but the problem is on a simple CPU machine and using pandas natively on cpu the 3d embedding take a lot of time to be computed.

This here is when RAPIDs helped me quite a lot since my data had over 5 million samples. I wanted to convvert the data into 3d space so I can represent it in form of a scatter plot using plotly.

Let’s import the required libraries


import cudf
import umap
import numpy as np
import plotly.express as px

And now let’s define some functions to read the data and to create umap features from our data that can represent the data in a 3d space.

def remove_string_columns(data):
string_columns = data.select_dtypes(include="object").columns
data = data.drop(columns=string_columns)
return data

def read_data():
data = cudf.read_csv("data.csv")
data.dropna(subset=["targetColumn"], inplace=True)
data = remove_string_columns(data)
return data

def compute_umap_features(data):
umap_model = umap.UMAP(n_components=3, n_neighbors=NEIGHBORS)
umap_features = umap_model.fit_transform(data.to_pandas().values)
data["umap_feature_1"] = umap_features[:, 0]
data["umap_feature_2"] = umap_features[:, 1]
data["umap_feature_3"] = umap_features[:, 2]
return data

Now that that’s done let’s also write our functions to plot the data and connect our functions into a pipeline.

def create_umap_3d_scatter_plot(data):
pd_data = data.copy()
dark_theme = {
"bgcolor": "#111111",
"font": {"color": "white"},
}

fig = px.scatter_3d(
pd_data,
x="umap_feature_1",
y="umap_feature_2",
z="umap_feature_3",
color="group",
title="UMAP 3D Scatter Plot",
)

fig.update_layout(title="UMAP 3D Scatter Plot", template="plotly_dark")
fig.update_layout(
scene=dict(
xaxis_title="Dimension 1",
yaxis_title="Dimension 2",
zaxis_title="Dimension 3",
)
)

fig.update_traces(marker=dict(size=1, sizemode="diameter", opacity=0.6))
fig.show()
return fig

def run_pipeline():
data = read_data()
data = compute_umap_features(data)
fig = create_umap_3d_scatter_dark2(data)
save_umap_3d_plot_to_html(fig, "sample_plot.html")

Let’s have a look at the results

My honest reaction after being stuck in this problem for like 3 days was something like below

Well, speeding up our process for million of records using rapids and umap in combination we can solve a lot of daily life problems that AI engineers face.

Conclusion

So, in a nutshell, RAPIDS is like the superhero I didn’t know I needed in the world of machine learning. Dealing with slow libraries on regular computers was a headache, but RAPIDS came to the rescue with its GPU magic. Transitioning from pandas was surprisingly smooth — it’s like they speak the same language. The real-life examples? Mind-blowing. RAPIDS didn’t just speed things up; it opened doors to possibilities I hadn’t even thought about. Remember that three-day struggle? Well, thanks to RAPIDS, it turned into a ‘problem-solved’ victory dance. It’s like having a superpower for data science. Here’s to leaving slow processing in the dust and getting things done at the speed of GPUs! 🚀
Until next time 👋

--

--