Top Dataframe Libraries in 2024

Thibaut Gourdel
6 min readJul 26, 2024

Since the inception of pandas in 2008, Python dataframe libraries have evolved significantly. With the arrival of new libraries such as DuckDB and Polars, users now have a wide array of choices. This expansion strengthens their current core functionalities but also opens up new use cases that were previously impossible. In this article, I will go over the main Python dataframe libraries, highlighting their pros and cons. Additionally, I will discuss some transversal trends that will shape the future of those libraries such as Apache Arrow, Apache Iceberg and GPU acceleration.

First, let’s take a look at some numbers: the download statistics for each library over the last month as of July 2024. This will give us a glimpse of the popularity and usage of each library we will cover in this article.

  • pandas: 230,506,059
  • pyspark: 29,736,841
  • polars: 7,012,513
  • duckdb: 5,554,222
  • ibis: 269,441

🐼 Pandas

Let’s start with the king of Python dataframe libraries, pandas. It’s the most used library in the data space, especially within the data science community. Pandas is primarily used for data analysis, exploration, and manipulation. Its advantage lies in being one of the earliest Python dataframe libraries, thus having the largest community and ecosystem maturity. However, early design choices are now considered outdated compared to today’s expectations in terms of usability and scalability. While pandas is the most widely used library with a broad and vibrant ecosystem, it is also adapting and evolving, or playing catch-up with newer libraries. Pandas 2.0 has introduced many welcome improvements, notably the adoption of Apache Arrow (see below for further details). Other initiatives, such as Modin, enhance pandas by enabling scalable pandas code, addressing some of its shortcomings. Pandas is and will remain the library of choice for data exploration and manipulation.

PySpark

If pandas is the king of single-node dataframe libraries for data exploration and manipulation, PySpark reigns supreme in distributed data engineering workloads. PySpark is the Python interface for Spark, a distributed processing system designed for fast big data processing and analytics. It supports a wide range of applications, from batch processing to real-time streaming and machine learning. PySpark is considered a mature interface for Spark and is used by large organizations for production workloads, but it still lacks some advanced features compared to the Scala/Java interfaces. Its many components, like SQL, MLlib, and GraphX, provide a strong ecosystem and excellent interoperability with other libraries, including other dataframe libraries. However, Python's interpreted nature can introduce performance overhead, and serialization between the JVM and Python can cause additional latency. Spark's advantages become particularly clear when dealing with large volumes of data or throughput-constrained use cases.

🐻‍❄️ Polars

Polars is a strong challenger to pandas, even though it is currently behind in terms of usage. Nonetheless, it is quickly gaining traction. Polars is a dataframe library developed in Rust, one of the fastest-growing programming languages, which allows for high performance with fine-grained control over memory. It was also developed with pandas’ pitfalls in mind to provide better scalability with minimal overhead and API friendliness. The ecosystem is still nascent but is rapidly coming together. Polars 1.0.0 was released in July 2024, indicating that it has reached a certain level of maturity. Other libraries are beginning to adopt it as well, with Modin releasing initial support for scaling Polars beyond single-node processing.

🦆 DuckDB

This library is not a dataframe library but is still very relevant in this context. DuckDB is an in-process SQL OLAP database management system. It can query data stored in various formats, including dataframes, which is one of the most common ways of using it. Indeed, DuckDB is often used with dataframe libraries like pandas or Polars to run SQL queries on top of dataframes. It has gained popularity thanks to its lightweight, embeddable nature, speed, efficiency, and because it can be easily coupled with dataframe libraries. It has also recently announced its 1.0.0 which shows a maturity of the ecosystem across these libraries. It also stands out for its excellent support for file data formats (CSV, Parquet) and general ease of use. For these reasons, while DuckDB can replace dataframe libraries in some cases, it is in most cases a formidable complement to these libraries.

🦢 Ibis

The Ibis framework is a slightly lesser-known library but is particularly interesting. Notably, it is backed by a company, Voltron Data, founded by pandas’ creator Wes McKinney. Ibis is also a dataframe library, but it provides a unified interface to work with various backends. Backends here mean other dataframe libraries as well as SQL systems and more. This means that using the same code, data workloads can be executed across multiple platforms such as DuckDB, Polars, PostgreSQL, Snowflake, Spark, etc., offering full interoperability. The full list is available here. In the end, Ibis offers an abstraction layer on top of other engines and SQL systems using a single dataframe syntax. Typically, this allows workloads to be executed locally and then deployed remotely without any hassle, enabling smooth migration paths for any backend.

Note: I could have included more libraries, such as Dask and Ray, but these libraries are not exclusively dataframe libraries but more general-purpose frameworks for distributed workloads and applications.

Other trends across dataframe libraries:

  • 🏹 Apache Arrow is a columnar memory format designed for flat and hierarchical data, optimized for efficient analytic operations on CPUs and GPUs. First, columnar storage reduces memory overhead for operations like filtering or aggregation since only the relevant columns are read into memory. Second, Arrow’s primary advantage is its in-memory format, which offers very high performance for data access, avoiding the serialization/deserialization overhead found in other formats. This makes Arrow particularly suitable for in-memory dataframe libraries such as pandas introduced with Pandas 2.0 and its early adoption and integration by Polars and DuckDB. Many initiatives are currently underway to improve the overall interoperability of data access through Arrow, with two examples being: Apache Arrow Flight SQL to accelerate database access and ADBC.
  • 🧊 Apache Iceberg is an open table format designed to bring traditional data warehouse-like features to data lakes. Alternative table formats include Apache Hudi and Delta Lake. They provide capabilities such as transactional consistency (ACID transactions), schema evolution tracking, time travel, partitioning, and versioning. Apache Iceberg consists of tables represented by metadata files (manifests) that track changes over time. Table formats form the basis of the Lakehouse architecture, initiated by Delta Lake from Databricks. One of the benefits of leveraging Iceberg is the ability to use different engines (Spark, Hive, Trino, etc.) over the same data without moving it. In terms of integration with dataframe libraries, the different ecosystems are not at the same level of maturity. PySpark supports Iceberg already due to Spark’s support. Some work has been done to access Iceberg tables natively in Python via PyArrow and PyODBC, but the native Iceberg implementation for Python (PyIceberg) is still young and lacking some core capabilities. Table formats such as Iceberg open new opportunities for designing data architectures, such as the Lakehouse architecture, which enables data workloads to be both scalable and cost-effective.
  • GPU acceleration is becoming a thing for dataframe-based libraries! Libraries like cuDF offer a GPU-accelerated pandas API to achieve significant speedups for data manipulation and analytics compared with CPUs. Due to the GenAI wave, GPUs are becoming more accessible than ever and could be a great way to scale and execute high-performance processing.

Amphi ETL is the first low-code graphical tool to design data pipelines and generate Python-code using dataframe libraries. It leverages common libraries such as pandas, DuckDB and more to come to generate the code you can deploy anywhere.

Data pipeline designed with Amphi

Amphi is free and open give it a try:
https://github.com/amphi-ai/amphi-etl

--

--

Thibaut Gourdel

I write about data engineering and ETL. I'm building Amphi, a low-code python-based ETL for data manipulation and transformation.