The evil pandas — https://www.flickr.com/photos/skynoir/6902110466

Pandas — the evil in data science and its alternatives

Yi Jin

--

Never a fan of pandas as a R user, I believe pandas with its index (what is tidy data?) and missing capability of chaining (compared to pyspark and tidyverse) it makes us harder to manipulate and crunch data in python compared to R or spark.

So a research has been done at the end of the 2022 to survey all the alternatives and attempt to make things right.

Speed Up Alternatives

  • Modin is a drop-in replacement for pandas. While pandas is single-threaded, Modin lets you instantly speed up your workflows by scaling pandas so it uses all of your cores. Modin works especially well on larger datasets, where pandas becomes painfully slow or runs out of memory.
  • Dask makes it easy to scale the Python libraries that you know and love like NumPy, pandas, and scikit-learn.
  • Vaex is a python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (109) objects/rows per second.
  • Polars : Lightning-fast DataFrame library for Rust and Python
  • Ray is an open-source unified compute framework that makes it easy to scale AI and Python…

--

--