All these small but powerful Pandas, Source

Member-only story

Minimal Pandas Subset for Data Scientists on GPU

Using GPU for preprocessing Data

--

Data manipulation is a breeze with pandas, and it has become such a standard for it that a lot of parallelization libraries like Rapids and Dask are being created in line with Pandas syntax.

Sometimes back, I wrote about the subset of Pandas functionality I end up using often. In this post, I will talk about handling most of those data manipulation cases in Python on a GPU using cuDF.

With a sprinkling of some recommendations throughout.

PS: for benchmarking, all the experiments below are done on a Machine with 128 GB RAM and a Titan RTX GPU with 24 GB RAM.

What is Rapids CuDF, and why to use it?

Built based on the Apache Arrow columnar memory format, cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.

Simply, Rapids CuDF is a library that aims to bring pandas functionality to GPU. Apart from CuDF, Rapids also provides access to cuML and cuGraph as well, which are used to work with Machine Learning algorithms and graphs on GPU, respectively.

Now, what is the advantage of this?

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Rahul Agarwal
Rahul Agarwal

Responses (1)