The keys to production Data & AI applications with Azure Databricks

Requirements and processes for managing infrastructure, data governance, and personas

Arun Wagle
Databricks Platform SME
10 min readMar 19, 2024

--

Introduction

As businesses increasingly prioritize data-driven and AI capabilities, leveraging Databricks platform on Azure Cloud can expedite achieving these goals. However, navigating the complexities of building end-to-end production applications on this platform can be challenging during the initial phase of the project.

In this blog post, we delve into crafting a unified, production-grade architecture with various Databricks resources that can be adapted and expanded to suit business needs.

Production Requirements

Some of the key challenges that we faced while working with customers were:

  1. Automated infrastructure provisioning for different teams and lines of businesses, each with their distinct roles within the organization.
  2. Data governance, auditing, understanding the data lineage across all kinds of data, including handling of sensitive data assets.
  3. Organizing data sharing needs with different lines of businesses and business partners.
  4. A unified development experience, ease of development, managing the lifecycle of projects, and having a proper testing framework in place.
  5. Proper monitoring and logging in place.

All the above should be a repeatable process for continuously extending and building on the existing platform.

Building a Unified Platform: Methods and Practical Insights

Based on actual customer implementations, the approach to constructing this unified platform is outlined in four steps below, to be executed in order.

  1. Platform setup using Databricks Terraform Provider.
  2. Streamlining development of complex projects using Databricks Asset Bundles (DAB).
  3. Enable secure data sharing through Delta Sharing, enhancing collaboration and data access within the platform.
  4. Monitoring and logging

Below is what conceptual architecture might look like.

Conceptual Architecture

Platform Setup Using Databricks Terraform Provider

Before we can start working on Databricks platform activities, we have to complete the following prerequisites in Azure Portal and design a few things which will be leveraged by Terraform.

Essential Takeaways

  1. You need a valid Azure Subscription
  2. Work with your cloud engineering team to design VNET and subnets required for different Databricks Workspaces.
  3. Design security hardening for workspaces as per the company’s security processes (see: Azure Databricks Security Checklist).
  4. Design all high level Microsoft EntraID Azure AD groups that will be sync them into Databricks.
  5. Azure Service Principal required for deployment of different Databricks resources.
  6. Organize Terraform scripts into different modules for ease of management.
  7. Manage the state of the Terraform in Azure Storage Backend.
  8. All the source code for the Terraform project is managed in Azure Repos.
  9. Design CI/CD process as per your organization needs to create the infrastructure as a part of Azure Pipelines.

Exploring Automation Capabilities of the Databricks Terraform Provider

At a minimum, leverage the Databricks Terraform provider to:

  1. Manage Databricks Workspaces
  2. Convert a user group to account admin
  3. Create a metastore
  4. Setup Service Principals and user groups
  5. Setup catalogs
  6. Setup cluster policies and compute resources

This will get you started with Databricks platform for your various business use cases. You can keep extending this platform to add other resources as your needs evolve.

Let’s look at each of these now in some detail.

Manage Databricks Workspaces

This example will create Workspaces and assign the Azure Service Principal used by the Terraform script as an admin to the Workspace.Some organizations require close to 20–30 Workspaces in use by different teams, so leveraging an automated process is highly recommended.By utilizing Databricks Terraform provider, teams can ensure consistency, repeatability, and reliability across their Workspace deployments. Terraform’s declarative configuration enables teams to define infrastructure as code, promoting version control and auditing capabilities.

Once you have run the Terraform script, log into the Admin Console and Workspace(s) to ensure everything was created successfully. The initial Databricks deployment will have the Azure Service Principal used by Terraform as account admin. This should be changed to a user group. To make this change, setup account level Databricks SCIM connector to sync the Azure account admin AD group to the Databricks account (see: Configure SCIM provisioning using Microsoft Entra ID). This is a one-time step, and all Azure AD groups will be sync them periodically to Databricks at the account level.

Convert a User Group to Account Admin

As of the writing of this article we cannot assign a user group as the Databricks account admin from the web UI, hence we are leveraging the Databricks Terraform Provider for this activity. The account admin user group will include all users (typically the cloud engineering team) and the Azure Service Principal used by Terraform scripts. Make sure to add the account admin user group to all Workspaces created above with the Workspace admin role.

Create a Metastore

A metastore is the top-level container of objects in Unity Catalog. It stores data assets (tables and views) and the permissions that govern access to them. Databricks account admins can create metastores and assign them to Databricks workspaces in order to control which workloads use the metastore. This example will help you to create a new metastore.

This step only needs to be done once per Azure region. The only reason to create a new metastore would be for Disaster Recovery (DR). If using Terraform, then the Azure Service Principal will be the default owner of the metastore. Similar to the previous step, make sure to change the default metastore admin to an Azure AD user group created for managing metastores. This will include all users (typically data admin team) and the Azure Service Principal used by Terraform scripts. Keep in mind this is a manual step.

Setup Service Principals and User Groups

This example will assign all the user groups created in Azure AD and sync them into Databricks using the SCIM application created above to the required Workspaces. Any service principals required to run jobs should be created and added to the required Workspaces.

The script should be designed to conditionally add user groups to different Workspaces. In larger organizations which have multiple lines of business, not all user groups will have access to all Workspaces and are limited by business functions and data access. For QA and production environments, the recommendation is to use only service principals to run jobs — individual user groups do not have access to run them.

Setup Catalogs

This example is responsible for creating all the catalogs and grant permissions on securable objects to user groups. It should be designed to conditionally create catalogs and fine-grained permissions on them. For example, admin user groups can have full privileges on a catalog while the development team will only have permissions to create schemas. The typical recommendation is to create a matrix of user groups with different permissions on catalogs, schemas, and other securables within Unity Catalog.

Here is a sample user-group UC permission matrix:

UserGroup-UnityCatalog permission matrix

Setup cluster policies and compute resources

This script is responsible for creating all the compute policies in different environments and compute resource creation based on those policies. This will also grant permissions to different user groups and service principals based on the specific policies and resources.

Compute resource creation for job execution is controlled with Databricks Asset Bundles, which we will see in the next section. You will create a compute cluster in advance with the Terraform script for teams that require specific data exploration, ad hoc data analysis, or AI/ML needs. For all of these, make sure that compute cluster creation is done with specific policies for different environments to control cost, and aid in cost attribution.

CI/CD for Terraform Projects using Azure Pipelines

All the Terraform setup should be managed as a part of CI/CD process. A typical flow for creating the resources in Databricks involves the following steps:

  1. All resources should be created from the master branch.
  2. Development teams required to create resources should create a feature branch of the project.
  3. Manage the input to the scripts in an environments folder.
  4. Create a pull request with a proper description of what resources need to be created and what files are part of the changes.
  5. Assign the pull request to at-least 2 approvers, one from the development team and the other from DevOps team to review the changes.
  6. On approval and merge into master branch, trigger the respective Azure DevOps pipeline.
  7. Leverage Azure DevOps best practices like branch policies.
  8. Update the release notes on the bottom of the page with the dates and changes.

Streamlining Development of Complex Projects using Databricks Asset Bundle (DAB)

Databricks Asset Bundles (DABs) are a new tool for streamlining the development of complex data, analytics, and ML projects for the Databricks platform. Bundles make it easy to manage complex projects during active development by providing CI/CD capabilities in your software development workflow with a single concise and declarative YAML syntax that works with the Databricks CLI.

By using bundles to automate your project’s tests, deployments, and configuration management you can reduce errors while promoting software best practices across your organization as templated projects. Here are some sample projects that use DABs to manage resource and job configurations: DAB-examples

Driving Project Efficiency with Databricks Asset Bundles

An execution strategy that will ensure project success with Databricks Asset Bundles involves the following:

  1. Engineering teams are responsible for setting up the DAB projects.
  2. Create a DAB custom template suited to your organization needs and all teams should create a project using the DAB template for consistent configurations.
  3. Leverage Azure Repos for source code management.
  4. Leverage Azure Devops pipeline for CI/CD build and deploy process, with unit and integration test cases to prevent and catch errors. (see: Run a CI/CD workflow with DAB).

Navigating Data Security: Exploring Different Strategies

Organizations have many policies in place when it comes to data security, but here we will focus on a few strategies which should be considered.

Handling sensitive data by filtering out rows of data and masking columns of data. If you are working with sensitive PII or PHI data, there are several strategies to ensure data security.

In the short term, use dynamic views for row level filtering and column level masking.

In a medium term time frame, you can filter sensitive data using row filters and column masks on tables. There are certain limitations to this approach, hence we are recommending it as a medium term strategy. Some of these limitations include materialized views and streaming tables in Delta Live Tables that don’t support row filters or column masks. Furthermore, Delta Sharing and Delta Lake time travel don’t work with row-level security or column masks.

In the long term, Databricks continues to innovate with Unity Catalog on a regular basis, so it is worth reviewing the release notes to see how new features may improve your data security strategy.

Other strategies to be implemented are:

Strong access control is emphasized through the implementation of user groups and roles, allowing for granular control over permissions within Databricks.

Leverage the “is_member” function to determine if the current user belongs to specific groups, particularly those handling sensitive PII/PHI data.

Isolation of the environment is achieved through platform-level security measures such as VPNs, Azure Private Link, and Restricted IP Lists. Adherence to Azure Databricks Security best practices is essential, including the creation of separate workspaces and cluster policies tailored for the handling of PII/PHI data, ensuring data integrity and confidentiality.

Audit logging is recommended to track and monitor activities within the environment. Leveraging Databricks Audit Logs alongside Cloud Storage Access Logs, Cloud Provider Activity Logs, and Virtual Network Traffic Flow Logs provides comprehensive visibility into user actions and system events.

Additionally, Databricks Lakehouse Monitoring tools aid in monitoring the health of the environment, while theSecurity Analysis Tool enables proactive identification and mitigation of security risks. These practices collectively enhance security posture and ensure compliance with data protection regulations.

Databricks Lakehouse Monitoring

Image taken from https://docs.databricks.com/en/lakehouse-monitoring/index.html

Enable Secure Data Sharing through Delta Sharing, Enhancing Collaboration and Data Access within the Platform

One of the requirements that many organizations have is sharing data with different lines of businesses and partners to maximize the value of their data. Delta Sharing meets this requirement in a simple way.

What is Delta Sharing?

Delta Sharing is an open protocol for secure data sharing with other organizations regardless of which computing platforms they use. It can share collections of tables in a Unity Catalog metastore in real time without copying them, so that data recipients can immediately begin working with the latest version of the data.

Delta Sharing

Image taken from https://www.databricks.com/product/delta-sharing

Conclusion

This article focused on streamlining platform deployment and enhancing development workflows for Databricks on Azure Cloud. We simplified the setup and configuration process of secure environments (including data security), and established automated testing and deployment using tools like Terraform and Databricks Asset Bundles. We enhanced collaboration and data sharing through Delta Sharing, facilitating efficient utilization of data resources.

By prioritizing security, efficiency, collaboration, and scalability, the architecture discussed here is capable of adapting to evolving business needs and technological advancements.

Call to Action

Begin Your Journey: Start leveraging Databricks on Azure to empower your organization with data-driven insights and AI capabilities.

Learn More: Dive deeper into the platform setup and development strategies outlined to fully grasp how to unleash the potential of Databricks.

Implement Automation: Explore Terraform automation and Azure pipelines to streamline deployment and development processes, enhancing efficiency and reducing time-to-market for your projects.

Embrace Collaboration: Prioritize collaboration and data sharing within your organization by implementing Delta Sharing for secure and efficient data exchange among teams.

Stay Agile: Continuously adapt and refine your architecture and workflows to remain ahead of the curve, effectively addressing emerging challenges and opportunities in the dynamic data and AI landscape.

--

--