Databricks Best Practices: Revamped

Matt Weingarten
4 min readJan 10, 2023

--

Goodbye Yellow Brick Road (I’m starting to run out of song titles here)

Introduction

I’ve written quite a few posts on Databricks in the past, but the best practices post I made back in August was premature. Since then, I’ve discovered many new cool and exciting ways to take advantage of Databricks, so I thought it was worth putting together a summarized post detailing everything I’ve noted so far.

Jobs

  • Service principals should be the owners of all production jobs so that permissions stay intact whenever individuals leave the team or company. Make sure the principals have access to the underlying notebooks as well for this to work properly.
  • Jobs should be integrated with Git so that it properly pulls the latest changes from version control. Service principals can be connected with Git by giving them proper token-level access to all the necessary repos where notebooks will be stored.
  • For jobs that have multiple tasks, task values should be used so that parameters just need to be specified at the beginning of a job. Typically, we’ve had the first task of a multi-task workflow place parameters into task values so that all other tasks can pull accordingly as needed. Make sure to set the default and debugValues for these variables so that individual notebook-level testing can still take place.
  • Use job clusters so that all compute is focused on the job itself and not being shared like in interactive clusters. This also gives you the freedom to use interactive clusters as more of a sandbox environment while job clusters are meant for production solely.
  • Send notifications on job failure so that you don’t always have to be monitoring in the Databricks UI. We currently use a PagerDuty-connected email for this (since our PagerDuty alerts already go to a Slack channel), but Slack and other HTTP-based webhooks are supported as well. You can also send notifications when duration treshold (SLA essentially) is exceeded. Unlike Airflow’s SLA, this definition is accurate and therefore is definitely worth using.
  • Tag jobs so that it’s easier to find them in the UI, similar to what I’ve recommended in the past for Airflow.

Libraries

  • Common code that be reused between notebooks/jobs should be stored in libraries. These libraries can then be called as needed so that code doesn’t have to be replicated multiple times.
  • When it comes to job clusters, make sure the needed libraries are installed as dependent libraries, as you won’t be able to have them baked into the cluster like you can with interactive clusters.
  • Have proper CI/CD so that library changes don’t modify the DBFS path for workspace libraries. By default, uploading the changes in the Databricks UI will result in a new random DBFS location being generated, which means that all job definitions and cluster specifications would need to be updated accordingly. If you can instead use S3 to host the libraries, that’s even better.
  • If possible, bake the libraries into the Docker image being used to build the cluster(s) (assuming Docker images are being used).

Cost-Savings

Note that I talked about many of these practices in a previous post.

  • With clusters, use auto AZ so that they’re as reliable as possible. With Spot instances, this will also choose the cheapest AZ to launch the clusters in.
  • Spot instances provide significant cost-savings compared to on-demand. You should definitely be using this in non-production environments, and even for non-critical jobs in production environments too.
  • If using general-purpose SSD for EBS, using gp3 EBS volumes provides significant cost-savings to gp2, and can be switched on in the workplace settings. Just make account limits are configured accordingly.
  • Enable autoscaling local storage so that you don’t have to allocate a fixed number of EBS volumes on each cluster.
  • Check off autoscaling so that the cluster only uses the resources it needs to when running.
  • Graviton-based clusters use the best performance to cost ratio for processors compared to other instance types.
  • The Photon runtime offers many performance benefits on top of regular Spark, and offers the best performance to cost ratio. You definitely will want to test to see if Photon provides benefits before switching it on, but it could be a gamechanger depending on your workload.
  • Right-size your clusters with the instance type and number of instances in your autoscaling configuration so that you’re keeping costs controlled.
  • Use fleet instance types to provide the highest availability for your jobs, resulting in fewer failures and hopefully lower costs as a result.

Conclusion

This should serve as a good compilation of all my best practices for Databricks, for now at least. I’m sure we’ll have a Revamped Part II soon enough. Either way, I always look forward to learning more through trial and error.

--

--

Matt Weingarten

Currently a Data Engineer at Samsara. Previously at Disney, Meta and Nielsen. Bridge player and sports fan. Thoughts are my own.