Setup an Ephemeral Metastore for the UC-only Workspace on AWS

Published in

Databricks Platform SME

4 min readJan 18, 2024

Introduction

Databricks Unity Catalog (UC) is an AI-driven, unified solution for data and AI governance, extending across various clouds and data platforms. The key features include the management of diverse data formats and AI-powered monitoring and observability tools for enhanced data quality and compliance. Nowadays, UC has been widely adopted by enterprise customers on the Databricks E2 platform, highlighting its effectiveness in managing complex, multi-cloud data environments efficiently and securely.

By default, the Databricks UC-enabled workspace provides the legacy hive_metastore as a feature to ensure backward compatibility. Additionally, this legacy hive_metastore is built on the AWS RDS service.

What if the Legacy Hive Metastore Cannot Be Used?

Since AWS PrivateLink currently does NOT support JDBC traffic, we must whitelist public outbound traffic from the VPC to the public RDS endpoints. Sometimes, a connection established over the public network cannot be approved by the corporate networking/security team.

Meanwhile, if we want to use Unity Catalog as our lakehouse governance solution and prefer not to expend additional effort in creating and managing our own external hive metastore within a private network — such as using their own RDS cluster or AWS Glue . Then, by blocking the hive metastore connection, our workspace becomes an ‘UC-only’ workspace

High-level Diagram for UC-only workspace

Complexities of Spark’s dependency on Hive Metastore

Because Databricks uses the legacy hive_metastore to offer backward compatibility, currently many Databricks services still check the hive_metastore connection when they first start. This leads to our UC-only workspace encountering the following issues:

“Metastore is down” error in cluster event

2. The first query in the notebook and SQL warehouse will experience the long wait time. This is due to the new cluster verifying the hive_metastore connection.

3. “SQL Connection Timeout” error in the cluster Log4j output (cluster driver logs)

Addressing Challenges with Derby — A Memory Embedded Metastore Solution

Apache Derby is an open-source relational database implemented entirely in Java. Its lightweight and embedded nature make it an ideal choice for an in-memory temporary metastore in this solution. Also Apache Derby is currently pre-installed in all the Databricks Long-Term Support (LTS) Runtimes.

To enable Derby for the compute clusters, we can directly enable derby metastore in cluster spark configuration. This will enable Derby on the cluster level:

spark.hadoop.javax.jdo.option.ConnectionDriverName org.apache.derby.jdbc.EmbeddedDriver
spark.hadoop.javax.jdo.option.ConnectionURL jdbc:derby:memory:myInMemDB;create=true
spark.hadoop.datanucleus.autoCreateSchema true 
spark.hadoop.datanucleus.autoCreateTables true
spark.hadoop.datanucleus.fixedDatastore false

Alternatively, for compute cluster, we can refer to this Compute policy reference to create/customize reusable cluster policies in the workspace:

 {
  "spark_conf.spark.hadoop.javax.jdo.option.ConnectionURL": {
      "type": "fixed",
      "value": "jdbc:derby:memory:myInMemDB;create=true"
  },
  "spark_conf.spark.hadoop.javax.jdo.option.ConnectionDriverName": {
      "type": "fixed",
      "value": "org.apache.derby.jdbc.EmbeddedDriver"
  },
  "spark_conf.spark.hadoop.javax.jdo.option.ConnectionUserName": {
      "type": "fixed",
      "value": "<This is optional>"
  },
  "spark_conf.spark.hadoop.javax.jdo.option.ConnectionPassword": {
      "type": "fixed",
      "value": "<This is optional>"
  }
}

To configure Derby for the Classic SQL warehouse, we need to update the “SQL warehouse configuration” in ‘Admin Settings’ -> “Compute”:

Please be aware that if we try creating a serverless SQL warehouse after adding Derby configuration, we will encounter the following error:

Invalid data access configuration for Serverless warehouses. External metastore is currently not supported with Serverless Compute.

This occurs because serverless warehouses currently do NOT support external metastores. As a workaround, we can remove the Derby configuration to create a serverless cluster. For a long-term solution, we could reach out to Databricks to submit a request for an exception

Caveats — It’s an Ephemeral Metastore

Any data saved to the ‘hive_metastore’ catalog will be wiped off when the cluster is stopped, as this Derby metastore exists only within the cluster’s RAM. It is important to ensure that your critical data is saved in Unity Catalog managed or other persistent locations.

We highly recommend testing this solution in a lower environment before using it for the production workload.

Summary

For workspaces unable to access Databricks’ legacy metastore endpoints, using Derby as a cluster-level, memory-based, ephemeral metastore offers a lightweight and flexible alternative.

This solution is also used in JD’s previous blog Isolake — A simplistic deployment design to an isolated Databricks Lakehouse on AWS

We highly recommend test this solution in a lower environment before deploying it in a production setting.