Should You Use Pandas for ETL?

5 min readJun 10, 2024

Hi, I’m Thibaut, and I write about data engineering and ETL. If you enjoy my content, consider following me on Medium.

📗 A bit of history

Before we dive in, let’s take a step back, when and why pandas was created in the first place? Pandas was publicly released in 2009 by Wes McKinney who was frustrated with the tools available at the time to perform basic data tasks. Python quickly gained tremendous popularity with the rise of data science in the 2010s, in part thanks to the ease of use of pandas. Pandas was really developed and optimized for what we commonly call the last-mile of data delivery, in that case data exploration and analysis.

🐍 The Python gravity

With the rise of data science and machine learning, it was only a matter of time before Python was also adopted in the data engineering communities. Data pipelines and in particular ETL workloads were heavily relying on Java-based processes in the past decades. However, the burden of managing different ecosystems with different libraries and the lack of interoperability pushes now a vast majority of teams to adopt Python for data pipelines.

🐼 What about Pandas?

This now time to tackle the subject of this article, is pandas a great fit for developing data pipelines? Although pandas wasn't designed for developing scalable data pipelines but for data exploration and analysis, does this necessarily make it a no-go for ETL?

Let’s start where the shoe pinches. The drawback you’ll hear again and again is the performance issues due to the use of in-memory dataframes (think tabular representation fo the data) and the single-core usage. Pandas is not capable of handling unlimited volumes of data, that’s for sure. In addition, because of the way it’s been designed, pandas is considered slow in some cases.

Alright then, why should I use it?

Well, as usual it’s never black or white. First, pandas is extremely popular, it’s been around for a long time. If you’re already used to pandas, it might just be easier to stick with it. Despite some criticism, its API still remains accessible and easy to learn, especially for manipulating data. Being the older and most popular dataframe library also means it has the largest ecosystem around it. Many other libraries extend pandas capabilities such as GeoPandas, and other data and AI libraries have built-in integration with pandas. The amount of documentation, examples and resources around pandas is just unparalleled. By the way, this also means all the LLMs have been heavily exposed to pandas code and are very strong at producing pandas code, something to keep in mind. In conclusion, if the data you are moving is within reasonable amount, pandas is more than fine and can get you a long way.

But that’s not all, due to its usage hegemony, a lot of effort and innovation went into solving the efficiency and performance of pandas’ shortcomings. First, aware of this, pandas is gradually evolving and addressing some of those issues. Pandas 2.0 introduced many welcomed improvements meant to boost performance, notably with the use of PyArrow. Second, a few libraries extend pandas to enable multi-core usage for time-consuming tasks such as mapply and pandarallel. Third, recent open-source initiatives such as Modin simply make your pandas code scale on multiple cores by only changing the import statement for pandas. For this, Modin leverages other known distributed frameworks namely Ray and Dask while preserving the pandas API.

🔄 Any alternatives?

I couldn’t end this article without mentioning some of the alternatives. Yes, pandas can be replaced (or complemented) by other dataframe libraries such as Polars, DuckDB and PySpark.

Polars is quickly rising in the data space, it’s multithreaded, uses a vectorized query engine and is written in Rust which makes it significantly faster than pandas. Its usage level is still far from pandas’ and the ecosystem is nascent. There are fundamental differences between Polars and pandas’ API, making any migration from one to another cumbersome.
DuckDB is a slightly different beast as it is a portable analytical database. Think SQLite for analytics. It provides a rich SQL dialect and great interoperability with other languages and libraries. If you like to use SQL to transform your data, this might be worth your time to look at it. I see DuckDB as a replacement for some cases but also as a complement to pandas and/or polars in some others.
PySpark is the python API implementation for Spark. As you know, Spark is the engine of choice for large-scale data processing. If you need processing for large volume of data, you should consider using Spark. Its dataframe API is different than pandas, but since a few years Spark now has a Pandas API, which means you can reuse most of your pandas code and run it on Spark clusters. Quite a game changer in favor of pandas!

📌 Conclusion

I do believe pandas can be used and will be used for ETL. First, because of the richness of its ecosystem and the ubiquity of its API as demonstrated above. Second, because pandas’ shortcomings are being addressed by initiatives like the pandas API on Spark or Modin which scale pandas. As usual, use the right tool for the right task. All dataframe libraries have their strengths and limitations, don’t get caught in the dataframe libraries wars!

Want to learn more? Read my next article: Unlocking the Power of Pandas on Major Cloud Platforms. This article explores the best ways to leverage Pandas for data engineering on major cloud providers: AWS, Azure, GCP, Snowflake and Databricks.

For Amphi ETL, a low-code ETL tool, we decided to use pandas as the primary framework for the reasons mentioned above. The vast ecosystem makes it an easy no-brainer for an ETL where integration is important. Initiatives like the Pandas API on Spark available in Databricks and Modin, and its support in major data platform like Snowflake, addresses the scalability issues. Amphi is free and open-source, give it a try!