Sharing External Hive-Metastore Across Jupyter on Kubernetes, Databricks and Azure Synapse

Ziang Jia
7 min readSep 14, 2022

--

Data Lakehouse architecture was clearly endorsed by Databricks in 2020. Other vendors are all developing their go-to-market data lakehouse platforms, such as Snowflake, Oracle OCI, Cloudera CDP, Azure Synapse, Google BigLake, etc. Nevertheless, the fundamental open-source projects behind the scene are not new at all — Delta Lake, Apache Iceberg, and Apache Hudi. As a result, data lakehouse developed base on the same open-source tool are more “natively interchangeable”.

With the help of data lake, organizations are able to keep their delta tables in a centralized cloud storage and plug in the data to the computing platform. However, users might found it annoying as they lost the visual to metadata and catalog when switching from one platform to the other. As a result, they either need to migrate their metadata or build a new metastore from external tables.

In this tutorial, I will discuss and demonstrate some interesting discoveries about how to share hive-metastore between Jupyter Notebook, Databricks, and Azure Synapse Analytics. This allows the administrator to directly plug in their persistent hive-metastore to vendor’s platforms and preserve all the metadata during migration.

Prerequisites

If you have read my previous blog: DIY MPP Platform: Integrate JupyterHub with Apache Spark on Kubernetes, you may already set up JupyterHub and start developing autoscaling Spark applications with it. However, only a few steps in that article are required to successfully complete this tutorial. If you do not want to go through that, try set up following the official guide as we will need them in this tutorial.

I will use Azure as an example in this article. I would suggest going through the below topics and training before proceeding to the next step.

  1. Azure Foundation Training
  2. Databricks Lakehouse Foundation
  3. Introduction to Synapse Analytics
  4. DIY MPP Platform: Integrate JupyterHub with Apache Spark on Kubernetes. This is optional if you only want to share metastore between Databricks and Synapse Analytics.

The Architecture

Metastore

According to the Hive documentation, we have several database options to use as the external database for the metastore. I tried many options and decided to use SQLServer as the RDBMS for Hive as this is the most straightforward set up. Feel free to explore other options such as MySQL or Oracle if needed.

One can simply create a managed SQLServer instance in Azure through the Azure SQL Database service. Note the connection credentials as we will use those in the next section.

Before we move on, we need to manually create the tables in SQLServer. This is because hive-metastore does not support creating schema automatically after version 2.0. Although we don’t know how this decision being made, we can simply retrieve the database init scripts used by Hive. In this tutorial, I would use 2.3.9. If we download the desired version from Maven, this script is located in /scripts/metastore/upgrade/mssql/hive-schema-2.3.0.mssql.sql. Run this script in the SQLServer using any IDE you like and all necessary tables will be created.

RDBMS for Hive

Platform Configurations

In this section, we will configure all three platforms — JupyterHub on Kubernetes, Databricks, and Synapse Analytics to plug in the external hive-metastore created in the section above.

JupyterHub

Open source platform is always an option out there. Plugging external hive-metastore in a Apache Spark cluster running on Kubernetes is a bit complicate but this hands-on experience would help us understand the multi-layer devOps dependencies that the platforms such as Databricks or Synapse Analytics are trying to encapsulate and hide from their end-users. I will skip the steps mentioned in my previous article and only list the additional steps here.

Firstly, we know that hive-metastore is one of the dependencies in Apache Spark but it is not shipped with the Spark runtime by default. We would need to build our own Spark runtime and include Hive as its dependency.

./dev/make-distribution.sh --name with-hive-but-not-hadoop --pip --tgz -Phive -Phive-thriftserver -Pkubernetes -DskipTests -Phadoop-provided

We will use this customized Spark distribution to build our Spark runtime image for the JuptyerHub to use on Azure Kubernetes cluster.

Secondly, because we use SQLServer as our external hive-metastore, we need to allow Spark runtime to be able to connect to SQLServer. This can be done by including spark-mssql-connector in our runtime image. Simply add these steps in the Dockerfile.

ENV MSSQL 1.2.0ADD https://repo1.maven.org/maven2/com/microsoft/azure/spark-mssql-connector_2.12/${MSSQL}/spark-mssql-connector_2.12-${MSSQL}.jar /opt/spark/jarsRUN chmod 644 /opt/spark/jars/spark-mssql-connector_2.12-${MSSQL}.jar

Finally, we need to tell Spark to connect to the external hive-metastore when a session is requested by users. Add these configurations to the spark-defaults.conf file.

spark.sql.hive.metastore.version                                                          2.3.9spark.sql.hive.metastore.jars                                                               builtinspark.hadoop.javax.jdo.option.ConnectionURL                                                 jdbc:sqlserver://databricks-poc-v1-sql-server.database.windows.net:1433;database=hive_metastore;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;spark.hadoop.javax.jdo.option.ConnectionDriverName                                          com.microsoft.sqlserver.jdbc.SQLServerDriverspark.hadoop.javax.jdo.option.ConnectionUserName                                            <USERNAME>spark.hadoop.javax.jdo.option.ConnectionPassword                                            <PASSWORD>spark.sql.hive.metastore.schema.verification                                                falsespark.sql.hive.metastore.schema.verification.record.version                                 false

Build the Spark runtime image and push it to the Container Registry then the JupyterHub on Kubernetes will load this runtime image next time we start a Spark session.

Databricks

To set up external hive-metastore in Databricks, the steps in the official guide are all we need. Simply create a init-script.sh like the one below.

#!/bin/shsource /etc/environmentcat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf[driver] {"spark.hadoop.javax.jdo.option.ConnectionURL" = "jdbc:sqlserver://databricks-poc-v1-sql-server.database.windows.net:1433;database=hive_metastore;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;""spark.hadoop.javax.jdo.option.ConnectionUserName" = "<USERNAME>""spark.hadoop.javax.jdo.option.ConnectionPassword" = "<PASSWORD>""spark.hadoop.javax.jdo.option.ConnectionDriverName" = "com.microsoft.sqlserver.jdbc.SQLServerDriver""spark.sql.hive.metastore.version" = "2.3.9""spark.sql.hive.metastore.jars" = "builtin"}EOF

There are two ways to inject this script into a Databricks Runtime — A) cluster init scripts or B) global init scripts. A cluster init script will only plug in our external hive-metastore when we query the data using that cluster, which means other clusters will not get affected. This also means that we can segment users’ visibility to data assets by plugging different external metastores in different clusters. On the other hand, A global init script allows us to plug in one universal external metastore to any clusters created by the users, which is another way to manage users’ visibility to data assets.

Option A: cluster init script
Option B: global init scripts

Azure Synapse Analytics

Configuring Synapse Analytics workspace to connect to hive-metastore is also very simple. Note that the Synapse Analytics is still under proactive update so the official guide may or may not work. If you follow the steps and it does not work properly, fail back to use the “native Apache Spark configuration” and everything should work fine.

Option A: native Apache Spark configuration
Option B: linked service configuration

Interchangeable Computing Engine

Now it is time for testing! First, we create a database called vaccination in Jupyter notebook and register some Kaggle medical datasets in it.

When initiate the spark session, we should explicitly call enableHiveSupport() to tell spark load the Hive configurations, otherwise, it would ignore them.

import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
import os
# Config Spark session
conf = SparkConf()
conf.set("spark.app.name","spark-test")
conf.set("spark.driver.host", os.environ["MY_POD_IP"])
conf.set("spark.submit.deployMode", "client")
conf.set("spark.executor.cores", "1")
conf.set("spark.executor.memory", "3g")
conf.set("spark.executor.instances", "2")
# Create Spark context
spark = SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()

We use sparksql magic to directly write SQL queries in notebook. This tool is open source https://github.com/cryeo/sparksql-magic and provides the similar functionality as the sql magic shipped in a Databricks notebook.

%%sparksql
create database if not exists vaccination
comment "Proof of Concept database for covid-19 vaccination project"
location "abfss://vaccination@zus1kaggledataadls.dfs.core.windows.net/warehouse";

Create some tables from the csv files in the staging zone.

spark.read.format("csv")\
.load("abfss://vaccination@zus1kaggledataadls.dfs.core.windows.net/staging/supplier-procedure-summary/*csv")\
.write.format("delta")\
.mode("overwrite")\
.saveAsTable("vaccination.supplier_procedure_summary")
%%sparksql
show tables in vaccination

If we describe the database, we can see the table we just created. This means the metadata is registered.

Jupyter Notebook

Next, we will login to Databricks to see if the same database and table are available in our workspace. In Azure Databricks, the administrator can enable AAD passthrough on the cluster so users who have access to the underlining ADLS container “vaccination” can access this table without configuring access key and secret.

Azure Databricks Notebook

Databricks recently released Unity Catalog. Since the external metastore we configured will only replace the local hive_metastore, it can co-exist in the data explorer with other metastores that shared by Unity Catalog.

external hive-metastore in Unity Catalog view

Similarly, user can access to the same “vaccination” database and “supplier_procedure_summary” table in Azure Synapse Analytics. Note this metastore is only accessible through the Apache Spark Pool.

Conclusion

We’ve only demonstrate that one hive-metastore can be shared in three difference platforms. In fact, more platforms can plug in this external hive-metastore as only as their computing engines are built on top of Apache Spark. I hope this insight inspires you new ways to manage the data and metadata in your organization.

--

--

Ziang Jia

Data Analytics Solution Architect | Cloud DevOps | AI & ML @Resultant partner with Microsoft, GCP, AWS, Databricks | Expertise in Kubernetes, Spark, Python, SQL