TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Member-only story

DATA SCIENCE

How to Empower Pandas with GPUs

A quick introduction to cuDF, an NVIDIA framework for accelerating Pandas

Naser Tamimi
TDS Archive
Published in
6 min readApr 7, 2024

--

Photo by BoliviaInteligente on Unsplash

Pandas remains a crucial tool in data analytics and machine learning endeavors, offering extensive capabilities for tasks such as data reading, transformation, cleaning, and writing. However, its efficiency with large datasets is somewhat limited, hindering its application in production environments or for constructing resilient data pipelines, despite its widespread use in data science projects.

Similar to Apache Spark, Pandas loads the data into memory for computation and transformation. But unlike Spark, Pandas is not a distributed compute platform, and therefore everything must be done on a single system CPU and memory (single-node processing). This feature limits the use of Pandas in two ways:

  1. Pandas on a single system cannot handle a large amount of data.
  2. Even for the data that fits into a single system memory, it may take considerable time to process a relatively small dataset.

Pandas on Steroid

The first issue is addressed by frameworks such as Dask. Dask DataFrame helps you process large tabular data by parallelizing Pandas on a distributed cluster of computers. In many ways…

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Naser Tamimi
Naser Tamimi

Written by Naser Tamimi

Data Engineer @ Expedia | Ex-Meta, Ex-Shell

Responses (8)