DataFrame IO Performance with Pandas, dask, fastparquet and HDF5

Bob Haffner
Jul 21, 2017 · 1 min read

EDIT: with the release of Pandas 0.21.0, reading and writing to parquet files is built-in. See the docs for more details

I was working with a fairly large csv file for an upcoming blog post and Pandas’ read_csv() was taking ~40 seconds to read it in. The file is 1.7gigs on disk with roughly 12 million rows containing a month of the popular NYC Taxi data.

Forty seconds isn’t too bad the first time, but I knew I would be reliving that 40 seconds with every relaunch of my Jupyter notebook. Time to make things a little faster.

I’ve been impressed with HDF5 read performance in the past, so I decided to use it again here. I also remembered a blog post from Continuum Analytics introducing fastparquet and this seemed like a good time to try it. And might as well see what dask can do.

The short story, HDF5 has the fastest IO with the compression nod going to fastparquet. Check out my Jupyter notebook below for the longer story

Let me know if you have any suggestions on improving the performance or compression of any of the methods I tried.

Cheers,
Bob

)

Bob Haffner
Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade