Pandas 2.0 vs Pandas 1.3 — Performance Comparison

3 min readApr 7, 2023

Pandas 1.3 vs Pandas 2.0 (with pyarrow). Measurements in seconds.

EDIT/ERRATUM: I made the mistake of combining parse_dates with pyarrow dtype backend. When removed, pyarrow is A LOT faster (40X) reading the dataset. 15 secs (without pyarrow) vs 496ms with pyarrow as dtype backend and engine.

Pandas 2.0 has recently been released, a long waited release that implements Apache Arrow as a backend for some data types.

It is still VERY early in the Arrow process, and it’s not used by default (must be explicitly chosen with the dtype_backend argument to any read_* method in order to use it).

Still, I started testing it out and comparing it with the standard 1.3.5 implementation. I performed some experiments using both 1.3.5 and 2.0.0 and compared the performance. Rather than switching the dtype_backend argument, I explicitly tested the different versions with different virtualenvs.

The data

I used a large CSV that contains ALL the posts in Hacker News since 2006. It’s around ~650MB and it’s available on Kaggle.

The experiment

I just ran a few simple operations, like creating columns, filtering, sorting and creating aggregations. The notebook with the experiments can be found here.

The Results (spoiler alert)

As expected,2.0 performed better with Strings and NaNs. There was a 16X improvement witha .isna() operation and a 3.2X improvement with a .str.contains() . Other string methods, like .str.startswith() or .str.strip() saw a 4X improvement.

To understand why this is happening, check out this post from Marc Garcia (core pandas dev): https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i

On the other hand, the performance when filtering numeric filters was poorer (2 times slower in average).

But, surprisingly, aggregation operations were SIGNIFICANTLY (10X) SLOWER. The following aggregation took 40 seconds with Pandas 1.3.5, but ALMOST 7 minutes with Pandas 2.0.0.

Conclusion

As said before, it’s still WAY too early to judge the Arrow backend. I think a good rule of thumb would be to start using the string[pyarrow] type for strings (or StringArrays):

df = pd.read_csv("...", dtype={"Column": "string[pyarrow]")

# Or explicitly:
df["Column"] = df["Column"].astype("string[pyarrow]")

Anything else, is more subject to trial error for a while, until performance is analyzed and polished in filtering and aggregation operations.

Pandas 2.0 vs Pandas 1.3 — Performance Comparison

The data

The experiment

The Results (spoiler alert)

Conclusion

Written by Santiago Basulto