Databricks Compute Policies: Take Control

Matt Weingarten
3 min readJan 9, 2024

--

Created by Chiefs wide receivers

Introduction

I’ve written a lot of posts in the past about different cost-savings or overall best practices within Databricks, but I’ve never really focused on Databricks from an administrative standpoint. Imagine you’re on a platform team responsible for providing a Databricks environment to tens or even hundreds of teams. If you’re going to be able to do this successfully, you’re going to need to have various controls in place.

Compute policies are one example of an area where you can help set the tone for what users can do in Databricks. I’m going to briefly introduce what compute policies are, and then how they can be used to really enforce good standards throughout Databricks.

What Are Compute Policies?

When you create a cluster in Databricks, you’ll see a “policy” dropdown in the configuration options. By default, this might be undefined, which essentially gives you free rein to do whatever you want in Databricks. While that sounds like an enticing offer, that also is an admin team’s worst nightmare, which is why it’s best to have proper controls in place.

If you have the proper access, you can go to the Compute tab and find the “Policies” tab. This is where any compute policies for your organization would be defined. Essentially, compute policies have a few key components:

  • A name (which you might see represented as an ID on the backend)
  • A list of restrictions (or enablements) that the policy grants (similar to what IAM does)
  • Permissions on which groups/individuals/service principals the policy applies to (after all, you don’t likely want everyone to use the same policy)

Some more details can be found in the documentation, but with that foundation in hand, let’s dive deeper.

Compute Policies Done Right

If you’ve ever looked at a job or cluster configuration in YAML/JSON, you’ll likely see a bunch of variables being set that you’re not even sure what exactly they’re being used for. These are very likely the things that a platform team wants to control for developer teams, so that costs don’t explode and everything is kept under control. Some examples include:

  • num_workers and autoscale.max_workers: These will help control the number of nodes a cluster can use. No need to worry about a 200-node cluster!
  • autotermination_minutes: You don’t want teams to be able to say that a cluster stays up for two extra hours of inactivity before finally terminating. Instead, this can help set a more reasonable threshold.
  • node_type_id and driver_node_type_id: If you don’t want teams to use 16xlarge instances, this is where it can be controlled. Just set an appropriate regex or specify a list of acceptable instances.
  • aws_attributes.first_on_demand: I’m of the mindset that interactive clusters should always use Spot instances, whereas on-demand should be reserved for critical job runs. If you want to set a similar methodology in place, this is where it’s done.

As for managing compute policies, the ideal answer would be to use IaC (one of my favorite concepts). Our platform team uses the Databricks Terraform provider to spin up all related guardrails, and it works really well. This also enforces proper permissioning of policies and separation of policies (separate policies for interactive vs. job clusters, as an example).

Starting Points

Most of what I said above is likely overkill when getting started with compute policies. Having policies, though, is way better than nothing as it gives admins peace of mind for everything that’s being managed through Databricks. I would suggest that at a minimum, each team have a policy that allows them to be able to do their work, but within the various constraints teams should have in a platform like Databricks.

There is always room for exceptions. A team may have a need for larger instances or more instances, and with a proper use case, that’s certainly fine. That’s why IaC makes this all even easier, so it can be defined and easily updated, with a history of why changes were made in the first place.

Conclusion

For those who are looking to run a bigger Databricks operation, give yourself a security blanket and implement compute policies as needed. You’ll be a lot better off with it, trust me.

--

--

Matt Weingarten

Currently a Data Engineer at Samsara. Previously at Disney, Meta and Nielsen. Bridge player and sports fan. Thoughts are my own.