Thanks for the write up Luciano. I’ve been using pandas for a year now but I didn’t hear about the dask before. I guess it will be pretty useful for me.
I often need to collect some simple stats from my zipped logs, each is as big as a few GBs. The complexity of my python codes reach to O(n^2).
I recently noticed that when I implement a script in C++ the execution time goes down drastically from order of days to order of minutes. The CPU utilization also goes down a lot. I was assuming that pandas has optimizations on vector processing and it is based on compiled C codes. So the performance difference was surprising. I hope to recover it partially with the parallelism of dask.
I’d like to ask your recommendation on the other tools/libraries that you can run simple queries on large datasets with a near C++ performance.
Thanks again!
