Azure Databricks Setup: A deep dive into every option🕵️
I was working with a customer and their security team wanted to know everything about the Azure Databricks workspace setup.
I thought my musings would be useful and worth sharing with all of you!
I’ll be focusing on Azure Databricks but the concepts and fundamentals should apply to AWS, GCP and Terraform scripts!
The first place to start is to ensure that everyone is on the same page. That page is the Azure Databricks Administration Overview which can be found here: Azure Databricks administration introduction — Azure Databricks | Microsoft Learn
Here’s an excerpt from the link (but please read it fully when you get a chance):
To enable the account console and establish your first account admin, you’ll need to engage someone who has the Microsoft Entra ID (formerly Azure Active Directory) Global Administrator role. For security purposes, only someone with the Microsoft Entra ID Global Administrator role has permissions to assign the first account admin role. After completing these steps, you can remove the Global Administrator from the Azure Databricks account.
The Global Administrator should use the following instructions:
1. Sign into your Azure Portal with your Global Admin credentials.
2. Go to accounts.azuredatabricks.net and sign in with Microsoft Entra ID. Azure Databricks automatically creates an account admin role for you.
3. Click User management.
4. Find and click the username of the user you want to delegate the account admin role to.
5. On the Roles tab, turn on Account admin.
Once another user has the account admin role, the Microsoft Entra ID Global Administrator no longer needs to be involved. The new account admin can remove the Global Administrator from the Azure Databricks account and assign other users the account admin role.
Once you have the correct rights, then you can search the Azure Marketplace for Azure Databricks, and then click create. This will take you to the basics page:
Project Details:
- Subscription: This dropdown allows you to select the Azure subscription under which the Databricks workspace will be billed and managed.
- Resource group: Here, you can either choose an existing resource group or create a new one. Resource groups are used to organize Azure resources into collections, which can be managed as a single entity.
Instance Details:
- Workspace name: This field is where you enter the desired name for your Databricks workspace. Typical example names could be dev, staging, or prod.
- Region: The dropdown menu lets you select the geographic region where your workspace will be located, which can affect latency and availability. You should also note the capabilities of Azure Databricks that you will immediately need. For example, if you definitely want to use Databricks Model Serving then you need to review this page Azure Databricks regions — Azure Databricks | Microsoft Learn as Model Serving is not yet available in any UK region!
- Pricing Tier: The dropdown menu indicates that you can select the pricing tier for your workspace. The “Premium” tier is pre-selected, which includes role-based access controls among other features. You can change this to the “Standard” tier based on your specific needs and budget. “Premium” tier is recommended in enterprise environments. Please refer to the pricing page for full details: Azure Databricks Pricing | Microsoft Azure
- Managed Resource Group: This field will let you enter a name for your managed resource group. Azure will then create a resource group dedicated to managing the resources for your Databricks workspace. This managed resource group will have a “super lock” on it to ensure that key Azure Databricks resources are not accidently changed or deleted.
Now, onto the Networking tab:
- Enable Secure Cluster Connectivity (aka No Public IP): By toggling this option, you can decide whether to assign public IP addresses to the virtual machines (VMs) in your clusters. Disabling public IP addresses can increase security by ensuring that VMs are not directly reachable from the internet. With secure cluster connectivity enabled, customer virtual networks have no open ports and Databricks Runtime cluster nodes have no public IP addresses. I recommend setting this to Yes!
- Deploy Azure Databricks workspace in your own Virtual Network (VNet): This toggle switch allows you to choose whether to deploy the workspace within an Azure VNet. Using your own VNet can provide better control over your network’s security and isolation. I recommend to set this to Yes!
- Virtual Network: If you choose to deploy within a VNet, this dropdown field is where you would select the identifier for your VNet.
- Public Subnet Name: Here, you specify the name of the public subnet. A public subnet is one where resources can be accessed from the internet.
- Public Subnet CIDR Range: This is the CIDR range to be used for the public subnet you’re using within your VNet.
- Private Subnet Name: This field is for the name of the private subnet, which is a subnet not accessible from the public internet, enhancing security.
- Private Subnet CIDR Range: Similar to the public subnet CIDR, this is the CIDR range to be used for the private subnet within your VNet.
- Allow Public Network Access: You can connect to your Databricks workspace either publicly, via public IP addresses, or privately, using a private endpoint. For most deployments this will be set to Enabled (to use Public Network access), and for highly secure environments it should be set to Disabled (to ensure Private Endpoints are used).
- Required NSG Rules: If allowing Public Network Access then set this to All Rules.
- Private Endpoints: This setting allows you to create private endpoints for your workspace, which are network interfaces that connect you privately and securely to Azure services. You can choose to enable them for all subnets or add them to a specific private endpoint. For highly secure environments private endpoints must be configured appropriately. This option will not be available if you enabled Public Network access! Learn more here: What is a private endpoint? — Azure Private Link | Microsoft Learn
And now the Encryption tab:
Data Encryption
This section allows users to add an additional layer of control over their data by using their own encryption keys. This is particularly important for managing security and compliance.
- Managed Disks: Users can opt to use their own key for encrypting managed disks. This is an irreversible action, meaning once you enable customer-managed key encryption, it cannot be disabled, but the key and key vault can be updated.
- Managed Services: Similar to managed disks, users can use their own key for services managed by Azure Databricks.
Double Encryption for DBFS root
Azure Databricks DBFS root can be encrypted with a second layer of encryption, known as infrastructure encryption, using a platform-managed key. This provides double encryption for added security.
- Enable Infrastructure Encryption: This option allows users to enable the second layer of encryption for the DBFS root. It’s important to note that this feature cannot be changed after the workspace is created.
And now the Security & Compliance settings:
Enhanced Security & Compliance Add-On
This add-on simplifies meeting security and regulatory requirements. It’s particularly useful for organizations that need to adhere to strict compliance standards.
- Enable Compliance Security Profile: When enabled, this feature activates additional monitoring, a hardened compute image, and other controls. It’s designed to help workspaces meet certain compliance standards, such as PCI-DSS for payment card data and HIPAA for health information.
- Enable Enhanced Security Monitoring: This option, when turned on, enables security monitoring agents that generate logs for review. It’s an important feature for maintaining visibility over the security status of your workspace.
- Enable Automatic Cluster Update: This feature ensures that clusters are automatically updated and restarted during a configured maintenance window to apply the latest updates.
It’s important to note that once these features are enabled, they cannot be disabled. This irreversible action underscores the commitment to maintaining a high-security standard within the workspace.
The Enhanced Security and Compliance Add-On does cost extra. Please see here: (source: Databricks Platform and Add-Ons)
Next up is the Tags page:
- Key-Value Pairs: Each tag consists of a key and a value. For example, you might have a tag with a key of “Environment” and a value of “Production”. This would help you quickly identify all resources that are part of your production environment.
- Logical Organization: Tags can be used to categorize resources by department, project, environment, or any other criteria that make sense for your organization. This makes it easier to manage and locate resources within the Azure portal.
- Billing and Cost Management: Tags can also be used to track costs. By applying tags, you can filter billing reports to see how much is being spent on different projects, departments, or environments.
- Security: While tags are stored as plain text and should never contain sensitive information, they can be used to apply governance and compliance controls across your resources.
- Access Control: You can control who has the ability to add or modify tags by assigning specific Azure roles, such as the Tag Contributor role, which allows tagging without granting full access to the resource itself.
- Automation: Tags can be used in automation scripts to perform actions on a set of resources that share the same tag.
- Inheritance: Tags applied to resource groups can be inherited by resources within that group, making it easier to manage tags across multiple resources.
You should be following these Azure best practicesfor naming and tagging: Resource naming and tagging decision guide — Cloud Adoption Framework | Microsoft Learn
The final page is the Review summary:
Review your settings, download automation templates, and if/when appropriate click Create. Eventually you will be able to launch your new workspace!
There’s a lot more to Azure Databricks and I would recommend that you read “Security Best Practices for Azure Databricks” to learn more. The latest version (1.1.1 — June 23 2023) of the PDF can be found here: Azure Databricks — Security Best Practices and Threat Model
Please note the opinions above are the author’s own and not necessarily my employer’s opinion. This blog article is intended to generate discussion and dialogue with the audience. If I have inadvertently hurt your feelings in anyway, then I’m sorry.