Dask — Parallelism for Analytics at Scale
Dask is one of the wonderful tools that exist in the Python ecosystem which allows the scaling of data workloads for datasets that typically do not fit in memory in a ‘typical’ workstation. I will be listing why I find it useful and why it works so well in scaling the existing Python packages.
Dask at its heart is a parallel computing library for Python. While there are other parallel computing frameworks out there such as Apache Spark etc, one of the key advantages of Dask is how it extends, almost in an incremental fashion, on the traditional data structure such as Arrays and Data Frames made available by popular packages such as Numpy and Pandas for typical in-memory processing. Since Dask natively supports Numpy, Pandas and Scikit-learn workflows it is a powerful tool for data scientists and data engineers to develop and customize data pipelines and models at scale. In addition, Dask also supports multi-core (which means you can make optimize existing compute power) and distributed parallel execution (similar to Spark, where you can have data being processed in parallel in clusters)
Data Structures and APIs in Dask
There are mainly 3 types of data structures or collections in Dask. They are as follows :
- Dask Array — Think Numpy Arrays with similar syntax but in distributed mode.
2. Dask Bag- Similar to the low level PySpark Resilient Distributed Dataset (RDD) data structure.
3. Dask Dataframe — A large parallel dataframe which can be thought of to consist of smaller Pandas Dataframes and extends the same syntax.
In terms of APIs there are other components built on Dask such as Dask-ML which is very much similar in syntax to Scikit-Learn. There is also the Delayed interface which can act as a wrapper or decorator function to extend general Pythonic code in Dask.
Dask runs under the hood by generating task graphs which arethen executed on parallelized hardware.
Dask has in general two broad types of scheduler:
- Single machine scheduler: The single machine scheduler provides basic features at the local process or thread pool level. This scheduler is the default. Although it can only be used on a single machine and is simple to use it should be noted that it does not scale.
- Distributed scheduler: This scheduler can run locally or distributed across a cluster. It is also more sophisticated, offers more features, but also requires a bit more effort to set up.
This is just a very brief introduction to the world of Dask and I hope it will be useful for you to understand in summary how it fits well to extend the existing ‘single-node’ ecosystem to analyze data at scale in parallel.