Custom Libraries in Databricks: a SparkR Example with LightGBM
Co-authors: Swami Venkatesh, PhD and Dagar Katyal
A problem and workaround for a custom solution
We were prompted to write this essay after working through an issue that was part of a larger project involving a custom-built machine-learning application for a Concurrency client. Specifically, the issue relates to leveraging custom libraries or packages in a way that conforms to a specific machine-learning design.
The Machine-learning Algorithm and Basic Design
This particular machine-learning application was required to identify and describe anomalous customer behavior throughout each day. Given that the transactions occur B2B and 100s of customers are invoking 100s of transactions per day, the results of our preliminary analysis suggested that the best approach was to model the behavior of each customer and flag behavior that deviates in a significant way. The implication is that each customer requires its own mathematical model. The machine-learning design is beyond the scope of this essay, but a component to the solution involved the use of R for statistical modeling and Microsoft’s LightGBM library (see https://lightgbm.readthedocs.io/en/latest/).
A small (but important) distinction
So far, this article may have raised an eyebrow given that Microsoft has built a custom gradient-boosting algorithm for Spark (https://github.com/Azure/mmlspark). Libraries like mmlspark are designed for distributed modeltrainingwhereas our architecture requires multiple model training across a distributed dataset fragmented by customer. This architecture is well suited for Spark since the feature-space is fairly large and customer behavior histories are considerably long. As a result, our grid-search cross-validation algorithm takes many minutes for each customer on a single virtual machine (3200% CPU utilization; 32 core, 128 GB RAM; Standard_D32s_v3 in Azure), and this duration was prohibitively long for 100s of customers. With Databricks, we can easily spin-up and spin-down a Spark cluster for highly parallelized model training, while dramatically reducing model training time.
Loading a custom library (like lightgbm) onto all compute nodes
There are two primary ways to accomplish this: 1) build an “init script” (Figure 1), which can automate downloading and installing of many packages on each node at cluster start-up and 2) compiling the package remotely and storing it within the distributed file system (Databricks proprietary filesystem is appropriately named dbfs; Figure 2). If it wasn’t obvious already, we don’t recommend the first approach simply because packages (and github repos) can (and will) change and init scripts will eventually fail given that they are written for a specific OS and package availability. The second approach protects you from most pitfalls, and the package will always be the same.
An aside on init scripts
An “init script” is a shell script that runs during startup for each cluster node before the Spark driver or JVM starts. You can use this shell script to carry out all the necessary steps to get the package to load on all the components of the cluster. Ensure all dependencies required to run the R package are installed in case they are not included in the standard build of R. Init scripts are loaded on the databricks page by accessing the ‘Clusters’ icon on the left panel and selecting your cluster in use. Under the ‘Configurations’ tab, click on ‘Advanced Options’, where you will find the tab init scripts. For linking the path to your shell script here follow the instructions given on this page — https://docs.azuredatabricks.net/user-guide/clusters/init-scripts.html#cluster-scoped-init-scripts. Anecdotally, we tested this approach; we wrote init script given that Spark is essentially managing a bunch of linux VMs, and the final shell script was about 20 lines of tech debt just to ensure that lightgbm would work. We abruptly abandoned this approach.
However, if you want to use a package isn’t on CRAN (and not built for a distributed system), here’s how you can do it. Remember, the package that we want to load into the cluster needs to be distributed across the master and worker nodes, to enable independent processing on each shard.
To cover your bases, we recommend compiling the package on a linux OS since all databricks worker nodes are running linux (see https://github.com/Microsoft/LightGBM/tree/master/R-packagefor installation details). Once you’ve installed the package, navigate to the directory in which the package resides, which can be accessed via:
~$ R -e ‘.libPaths()’ “/usr/local/lib/R/site-library”~$ cd /usr/local/lib/R/site-library
You can check that the library is indeed present using something like
~$ ls -lrt | grep lightgbm
Then, compress the folder using tar
~$ tar -zcvf ~/lightgbm.tar.gz ./lightgbm
Now that we have the compressed archive file, we need to move it to our specific dbfs (see https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html#access-dbfs-with-the-databricks-clifor details on installing the Databricks CLI for transferring data). With the Databricks CLI, we can use the following command to move data from the local file system to the remote dbfs (assuming that the folder, “r-packages,” has already been created).
~$ dbfs cp ~/lightgbm.tar.gz dbfs:/r-packages/
Test the package in a Databricks R notebook
First, test that the package installs and loads successfully on the head node.
SparkR relies on its own user-defined function (UDF — more on this in a future essay) mechanism to run a more sophisticated function on a large dataset. In this case, a cross-validation function on a lightgbm model is to be run for data describing the behavior of a single customer. Worker nodes cannot make use of custom packages (and lightgbm) unless its installed and loaded within the UDF (Figure 4).
Finally, we can test this by calling the UDF through gapplyCollect() and handing it a bogus dataframe with a column named “a” (Figure 5).
Now, the subsequent code we write within the UDF can leverage all capabilities of the lightgbm R-package.
Things to come
Given that you would invest tens of thousands of dollars on a distributed file system, you would better compression of the data so that more data can be stored in less space. Traditional data storage in raw human-readable file formats like CSV or JSON formats are inefficient and cannot be stored in a parallel manner, which makes processing this data very hard. The parquet file format — for Hadoop and other distributed file systems uses a flat columnar format. This make data storage and performance more efficient. Keep an eye on this blogspace for more blogs on how to build framework for forecasting supply stocking during a promotion period using Databricks!