agyaglikci
Sep 4, 2018 · 1 min read

Thanks for the write up Luciano. I’ve been using pandas for a year now but I didn’t hear about the dask before. I guess it will be pretty useful for me.

I often need to collect some simple stats from my zipped logs, each is as big as a few GBs. The complexity of my python codes reach to O(n^2).

I recently noticed that when I implement a script in C++ the execution time goes down drastically from order of days to order of minutes. The CPU utilization also goes down a lot. I was assuming that pandas has optimizations on vector processing and it is based on compiled C codes. So the performance difference was surprising. I hope to recover it partially with the parallelism of dask.

I’d like to ask your recommendation on the other tools/libraries that you can run simple queries on large datasets with a near C++ performance.

Thanks again!

    agyaglikci

    Written by

    sahanda yumurta