Sitemap
CodeX

Everything connected with Tech & Code. Follow to join our 1M+ monthly readers

This Pandas Alternative is 350X Faster When Processing 100 Million Rows

Data analytics in Python done right — Here’s how you can read and process 111 million rows in under 2 seconds!

5 min readApr 24, 2024

--

Press enter or click to view image in full size
Photo by CHUTTERSNAP on Unsplash

Everyone and their mother knows Pandas. It’s a good library for newcomers to data analytics, but among the slowest ones if you’re interested in processing huge volumes of data.

Enter DuckDB — an open-source, embedded, in-process, relational OLAP DBMS. A lot of jargon, but essentially, it’s an analytical columnar database running in memory that is designed for speed and efficiency. It’s several orders of magnitude faster than Pandas, especially when working with large datasets.

The best part? DuckDB has a Python library, meaning you can replace your slow Pandas aggregations in no time, especially if you know SQL.

Today you’ll see just how these two compare when aggregating more than 100 million rows of data. Let’s dig in!

Pandas vs. DuckDB Benchmark Setup

This section provides information on the dataset and Pandas/DuckDB code for the benchmark. For the frame of reference, I’m using M2 Pro MacBook Pro 12/19 core with 16 GB of RAM, so your results may vary.

--

--

CodeX
CodeX

Published in CodeX

Everything connected with Tech & Code. Follow to join our 1M+ monthly readers

Dario Radečić
Dario Radečić

Written by Dario Radečić

Data scientist • Tech writer • Author • I help developers start writing online and automate content creation with AI • https://writingfordevs.substack.com

Responses (13)