Managing Databricks: Asset Bundles

Matt Weingarten
3 min readOct 13, 2023

--

Bricks on bricks

Introduction

Remember when I mentioned that Databricks was cooking up a process for better workflow management in their quarterly roadmap call? Well, it appears to finally be here (or in Public Preview at least).

Asset bundles are a way to follow proper software development practices when it comes to handling all your Databricks resources. I was naturally ecstatic to see this available and had to give it a spin.

Getting Started

I decided to take a crack at creating a job via asset bundles by following the corresponding documentation. I took an existing notebook, put it into a new directory, and then defined the job configuration in the databricks.yml file (interesting that Databricks decided to go with YAML here when their jobs are represented as JSON in the UI). For the job configuration details, I was able to just convert my already-existing job details from JSON to YAML, and then created the asset bundle with one simple command. Sure enough, both my notebook and job were created in Databricks very quickly. Easy!

As I’ve mentioned before, we already have a similar process in place for creating Databricks jobs. In fact, the job configuration is essentially the same since they’re both YAML-based and leverage the API to actually instantiate the jobs. Now let’s get into comparing them.

Asset Bundles: The Good

Sure, I’m not an expert on asset bundles (but then again, who would be since they just became available?). But, getting started was pretty straightforward, which definitely isn’t the case for any custom-built predecessor. Letting Databricks own that heavy-lifting allows platform teams to focus on other critical items.

Another benefit of this approach is that its use cases aren’t just restricted to data engineers. Sure, it mentions software engineering in the blog post, but this is something that’s basic enough for analysts to take advantage of, especially if proper templating is put in place for common job configurations. As someone who has been focusing on cost savings in Databricks, one of the biggest challenges I’ve faced is how many jobs we have that aren’t configured through version control. It’s a lot easier to change a few lines of code and deploy instead of edit a million things in the UI. If that’s value that asset bundles can unlock for our teams, then all the more power to this offering.

Asset Bundles: The Bad

For starters, jobs by default would be owned by whoever is creating them. However, data teams should want all of their jobs to be owned by a service principal so they never run the risk of being lost due to permission issues. To accomplish this, it’d be necessary for the user to make the service principal the owner (editing a service principal’s permissions is quite the exercise in Databricks) or for a developer to authenticate as a service principal and do it that way (which is a security concern if you ask me). Maybe this can be easily managed if the platform team set up repos to do it? That’s something worth considering at least.

It also seems difficult to have proper separation of your job configurations. In our current approach, we have one file per job, but asset bundles set it so that there’s one file at the directory level. So you can only accomplish that level of separations with a lot of different directories, or instead have one giant file with everything. This isn’t a big deal either, but having readable files is something I do think is very helpful.

Conclusion

All in all, having asset bundles is a big winner for the overall ease of use when it comes to doing Databricks the right way. There are still a few things that I think can be improved (which is why we have public preview periods), so we’ll see what it looks like when it becomes GA in a few months.

--

--

Matt Weingarten

Currently a Data Engineer at Samsara. Previously at Disney, Meta and Nielsen. Bridge player and sports fan. Thoughts are my own.