Dask on Azure Databricks

Chris Gell
BehindTheWires
Published in
5 min readSep 30, 2019

We use Azure Databricks for building data ingestion , ETL and Machine Learning pipelines. Databricks provides users with the ability to create managed clusters of virtual machines in a secure cloud environment, a Notebook interface to run code on these virtual machines as well a REST API to automate the running of notebooks and creation of other resources within the environment.

Unsurprisingly, Databricks clusters are deployed with Apache Spark pre-configured, go here to learn more about Databricks. According to our friend Wikipedia

“Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance”.

However like many large organisations, we have teams with differing levels of experience with a wide variety of tools and frameworks. One size doesn’t fit all! The goal of our team is to create the best possible user experience for Data Scientists across our organisation, whilst leveraging the platforms that we have in place as much as possible.

For those not familiar, Dask is a pure python framework that describes itself as providing advanced parallelism for analytics, enabling performance at scale, for more information please visit dask.org. Recently an internal project made heavy use of Dask during development, and the project team wanted to scale our their use of Dask from local desktops to clusters. To support this project, our goal was to enable users to create distributed Dask clusters, without going through the effort of creating our own infrastructure.

In this article, we wanted to share with you how we have leveraged Azure Databricks’ ability to enable users to create managed clusters of VM’s in the cloud, and configure those clusters as distributed Dask environments. We will also outline our experiences as well as some shortcomings of the solution.

Dask Cluster Set-Up

There are different options for starting a Dask cluster, we decided to use the command line approach outlined here. In the context of an Azure Databricks Cluster (see diagram) we wanted to achieve the following

M = master node, W = worker node.
  • The ‘dask-scheduler’ process to run on the master node
  • ‘dask-worker’ processes running on the worker nodes

Fortunately for us, Databricks provides the ability to run scripts during the start-up of every VM inside the cluster, along with a bunch of handy environment variables. To set-up our distributed dask cluster we created the below init script, which

  1. Creates a directory named after the cluster ID for storing log and PID files
  2. Installs Dask and a specific version of Pandas to work with Dask
  3. Starts dask-scheduler or dask-worker processes and saves down their PIDs (for reference should you need them)
#!/bin/bash
# CREATE DIRECTORY ON DBFS FOR LOGS
LOG_DIR=/dbfs/databricks/scripts/logs/$DB_CLUSTER_ID/dask/
HOSTNAME=`hostname`
mkdir -p $LOG_DIR
# INSTALL DASK AND OTHER DEPENDENCIES
set -ex
/databricks/python/bin/python -V
. /databricks/conda/etc/profile.d/conda.sh
conda activate /databricks/python
conda install -y dask
conda install -y pandas=0.23.0
# START DASK – ON DRIVER NODE START THE SCHEDULER PROCESS
# ON WORKER NODES START WORKER PROCESSES
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
dask-scheduler &>/dev/null &
echo $! > $LOG_DIR/dask-scheduler.$HOSTNAME.pid
conda install -y pandas=0.23.0
else
dask-worker tcp://$DB_DRIVER_IP:8786 --nprocs 4 --nthreads 8 &>/dev/null &
echo $! > $LOG_DIR/dask-worker.$HOSTNAME.pid &
conda install -y pandas=0.23.0
fi

To find out how to create an init script in Azure Databricks please refer to the official documentation here

Databricks Cluster Configuration

Below is our cluster configuration. Two points to highlight

  1. Run-time — we needed to use the conda run-time which is at time of writing, still in beta
  2. Init Script Configuration — we need to configure our cluster to execute our init script on startup

And that’s it. To use Dask in a notebook, add the below code snippet

from dask.distributed import Client
c = Client('127.0.0.1:8786')

A quick and dirty check

Running a quick test and checking out the Ganglia Metrics UI in Databricks shows us that all worker nodes are being used to run computations

Some things to consider

One of the most unclear aspects of using Dask was making sure libraries are consistent between workers and driver nodes. Databricks has at least 3 different ways of installing libraries

  1. Using the Databricks UI
  2. Using %sh magic tags and running commands pip install, /databricks/python/bin/pip install, conda install
  3. Using dbutils

After trial and error we went with adding conda/pip install statements in the cluster init script. However, this approach did add considerable amount of time to the cluster start-up.

The second point to consider is the Dask UI. Dask comes with a cluster monitoring dashboard, see here. The Azure Databricks Web App does proxy requests through to processes running on a cluster driver node. This is to support items such as

  • RStudio Server on Azure Databricks (here)
  • TensorBoard on Azure Databricks (here)

However there are limitations (such as no support for websockets) that prevent the Dask UI working correctly this way. This is a shame, as Azure Databricks does have the potential of being a very powerful and flexible general purpose Data Science and Data Engineering platform for large enterprises. It is just not quite there yet.

Our recommendation is to start Dask without the UI running to ensure no potential security issues arise from running processes on the driver node that can be accessed by the web.

Conclusions

So far, overall experience using Dask on Databricks was pleasant. In a large enterprise, the ability to enable users to self serve their own compute and configure it to use a variety of tools and frameworks, whilst leveraging the security and manageability provided by a PaaS solution is very powerful. This exercise proves that we can leverage the Azure Databricks platform for more than Spark.

However, it is not a completely polished solution, managing library versions and the fact that the Dask UI doesn’t work properly does take some of the shine away. Although some of this might be down to our own knowledge of the platform, as we learn more, we will update this post. Equally if any readers of this post can help us improve on what we have done, we’d love for you to get in touch.

--

--