DataFrame IO Performance with Pandas, dask, fastparquet and HDF5
EDIT: with the release of Pandas 0.21.0, reading and writing to parquet files is built-in. See the docs for more details
I was working with a fairly large csv file for an upcoming blog post and Pandas’ read_csv() was taking ~40 seconds to read it in. The file is 1.7gigs on disk with roughly 12 million rows containing a month of the popular NYC Taxi data.
Forty seconds isn’t too bad the first time, but I knew I would be reliving that 40 seconds with every relaunch of my Jupyter notebook. Time to make things a little faster.
I’ve been impressed with HDF5 read performance in the past, so I decided to use it again here. I also remembered a blog post from Continuum Analytics introducing fastparquet and this seemed like a good time to try it. And might as well see what dask can do.
The short story, HDF5 has the fastest IO with the compression nod going to fastparquet. Check out my Jupyter notebook below for the longer story
Let me know if you have any suggestions on improving the performance or compression of any of the methods I tried.
Cheers,
Bob
