Connect Azure SQL DB as External Hive Metastore to Azure Databricks Spark Cluster and SQL Endpoint

Margaux Vander Plaetsen
7 min readDec 22, 2021

--

Every Databricks workspace has an internal Hive metastore holding metadata. However, some scenarios require an external metastore that is shareable and independent of the workspace’s existence. This article will cover the end-to-end process of setting up an external metastore using Azure SQL Database and consequently connecting to both an Azure Databricks Spark cluster and an Azure Databricks SQL Endpoint.

1. Create an Azure SQL Database and store secrets in an Azure Key Vault

The first step in this guide is the creation of an Azure SQL Database (metastoredb). Browse to the ‘Select SQL deployment option’ page and create a single SQL database.

Azure SQL Database creation

For a full guide on how to create an Azure SQL Database, visit Create a single database — Azure SQL Database | Microsoft Docs. For this example, the Basic pricing tier and locally-redundant backup storage suffice.

In this scenario, SQL Authentication is used for the SQL Database Server. To avoid hard-coding, store the username and password as secrets in an Azure Key Vault (Quickstart — Create an Azure Key Vault with the Azure portal | Microsoft Docs).

Secret generation in Azure Key Vault

2. Create an Azure Databricks workspace and set up a secret scope

The next step is creating an Azure Databricks workspace. This can easily be done through the Azure marketplace. For a detailed process on how to set up a workspace, I refer to Quickstart — Run a Spark job on Azure Databricks Workspace using Azure portal | Microsoft Docs.


Azure Databricks SQL requires the Databricks account to be on the Premium plan. As a consequence, if you want to execute part 4, opt for the Premium pricing tier when setting up the Databricks workspace.

Now that the Databricks workspace is created, you should have a resource group that looks similar to this:

Azure resources created

In our Databricks workspace, we need to consult the secrets we created in our Azure Key Vault to use the Azure SQL Database as an external metastore. This begins with creating a secret scope, which is a collection of secrets. There are two types of scopes possible: Azure Key Vault-backed and Databricks-backed. Since I stored my secrets in a Key Vault, let’s go for the Azure Key Vault-backed scope.

It is possible to use the Azure Databricks UI to create this scope. For this, go to: “https://<databricks-instance>#secrets/createScope”

Secret scope creation

The DNS Name and Resource ID can be found in the Properties tab of your Azure Key Vault.

💡
— Retrieve the secret scopes of you workspace with “dbutils.secrets.listScopes()”
— Retrieve secrets in a scope with “dbutils.secrets.list(scope = <scope-name>)”

List secret scopes with secrets in an Azure Databricks notebook (requires a cluster that is up and running)

Why a secret scope has to be created in Azure Databricks will become clear in the next section.

3. Create a Spark cluster and connect it to the external metastore

To create a Spark cluster in your Azure Databricks workspace. Go to the Compute tab in the Data Science & Engineering experience and click “Create Cluster”. After doing that, you get the following screen.

Azure Databricks cluster creation

There are many configuration options to set. To make the right decisions, I advise you to consult Configure clusters — Azure Databricks | Microsoft Docs.

For this demonstration, I focus on the Advanced Options. To complete this section, Microsoft has provided documentation on how to connect an external metastore:

# Hive-specific configuration options.
# spark.hadoop prefix is added to make sure these Hive specific
options propagate to the metastore client.
# JDBC connect string for a JDBC metastore
spark.hadoop.javax.jdo.option.ConnectionURL <mssql-connection-string>
# Username to use against metastore database
spark.hadoop.javax.jdo.option.ConnectionUserName <mssql-username>
# Password to use against metastore database
spark.hadoop.javax.jdo.option.ConnectionPassword <mssql-password>
# Driver class name for a JDBC metastore
spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver
# Spark specific configuration options
spark.sql.hive.metastore.version <hive-version>
# Skip this one if <hive-version> is 0.13.x.
spark.sql.hive.metastore.jars <hive-jar-source>
# Source: Set up an external metastore using the UI

Let’s start with copying this into the Spark Config box and replace:

  • <mssql-connection-string>: JDBC connection string of Azure SQL DB
Azure SQL Database JDBC connection string
Initial Spark Config

This is just the basics. If you now want to create a database, you will probably go through some troubleshooting. For instance, some other things to configure are:

  • Client with IP address ‘xx.xxx.xx.xx’ is not allowed to access the server: To enable access, we can use the Portal to create a firewall rule on our SQL Server for this IP address.
  • External metastore tables not available: Since I am using Hive 1.2.1, the following two lines need to be added to the Spark Configuration:
datanucleus.autoCreateSchema true
datanucleus.fixedDatastore false
Final Spark Config

In case other issues would pop up, the following documentation helps the troubleshooting:

Once everything has been set up, we can create a database (examplemetastore) and a table (orders) in an Azure Databricks notebook and see the results in the Data tab:

Database and table creation
Metastore results in Azure Databricks Data Science & Engineering

4. Create a SQL Endpoint and connect it to the external metastore

Now that we succeeded in accessing the external metastore from our Spark cluster in the Data Science & Engineering tab of Azure Databricks, the same can be done for Azure Databricks SQL endpoints. First, to create a SQL endpoint, switch to the Azure Databricks SQL experience and go to SQL endpoints in the sidebar. There you can easily create a new SQL endpoint. For a detailed description of this creation process, visit SQL endpoints — Azure Databricks — Databricks SQL | Microsoft Docs.

For this example, I will work with the Starter endpoint that is automatically available in Azure Databricks. The next step is setting up the connection between the SQL endpoint and our external metastore through the SQL Admin Console.

Go to Settings > SQL Admin Console and click on ‘SQL Endpoint Settings’. There we can paste the same configuration settings as our Spark config into the Data Access Configuration box.

SQL Endpoint Configuration

The datanucleus configurations will appear as illegal statements. We can solve this by adding “spark.hadoop.” at the start of these two statements.

After saving, start one of the SQL endpoints. Once the SQL endpoint has successfully been spun up, you can find the same database examplemetastore and table orders as we created before in the Data Explorer tab.

Metastore results in Azure Databricks SQL — Data Explorer

If you are eager to get to action, you could also directly query the table with the SQL Editor tab in Azure Databricks SQL.

Metastore results in Azure Databricks SQL — SQL Editor

You may get the error mentioning that the list of databases in the hive_metastore catalog cannot be loaded. This can be caused by the firewall that is enabled on the Azure SQL Database. You can change the firewall settings on your SQL DB in Firewalls and virtual networks. One of the ways to overcome this issue is by allowing Azure services and resources to access this server. Other solutions are whitelisting IP addresses or working with Virtual Networks and attaching Databricks and SQL DB to the same VNet, for instance.

Time to get your hands dirty. Have fun!

--

--