Metastore in Apache Spark

Md Sarfaraz Hussain
3 min readApr 6, 2022

--

Metastore

Sharing one of my learning on Apache Spark, this might sound very basic but I was not aware of it until a few days ago and maybe many of you might also not know this, so thought of sharing it!

But before that, let’s understand what is Metastore in Spark? 🤔

=> Metastore (aka metastore_db) is a relational database that is used by Hive, Presto, Spark, etc. to manage the metadata of persistent relational entities (e.g. databases, tables, columns, partitions) for fast access. Additionally, a spark-warehouse is the directory where Spark SQL persists tables. 😀

Spark SQL by default uses an In-Memory catalog/metastore deployed with Apache Derby database. (Woh! I didn’t knew that 🤯)

You can check by following — 🔺println(spark.sharedState.externalCatalog.unwrapped)
🔻res: org.apache.spark.sql.catalyst.catalog.InMemoryCatalog

But the default embedded deployment mode is not recommended for production use due to the limitation of only one active SparkSession at a time.

Now, comes the Hive Metastore which I and all are familiar with and is recommended for production.

To enable Hive Metastore, we need to add .enableHiveSupport() API when we create our Spark Session object and is by default deployed with Apache Derby database, but we can change to any other database like MySQL.

You can check by following -
🔺println(spark.sharedState.externalCatalog.unwrapped)
🔻res: org.apache.spark.sql.hive.HiveExternalCatalog

HiveExternalCatalog uses spark.sql.warehouse.dir directory as the location to save databases and tables.

Sample Code

The catalog implementation is controlled by spark.sql.catalogImplementation and can be one of the two possible values: “hive” and “in-memory”

DataFrames can also be saved as persistent tables into Hive metastore using the saveAsTable command. Unlike the createOrReplaceTempView command, saveAsTable will materialize the contents of the DataFrame and create a pointer to the data in the Hive metastore. Persistent tables will still exist even after your Spark program has restarted, as long as you maintain your connection to the same metastore. A DataFrame for a persistent table can be created by calling the table method on a SparkSession with the name of the table.

Note: Spark SQL defaults to in-memory (non-Hive) catalog unless you use spark-shell that does the opposite (uses Hive metastore)

Here are two more follow-up question which may arise in the minds of many:

  1. I do not have Hive setup in my cluster, can I still use the metastore?

=> If we do not have Hive setup, we can still use the Hive metastore in Spark. When Hive is not configured in “hive-site.xml” of Spark, then Spark automatically creates metastore (metastore_db) in the current directory, deployed with default Apache Derby and also creates a directory configured by spark.sql.warehouse.dir to store the Spark tables, which defaults to the directory spark-warehouse in the current directory that the Spark application is started. You may need to grant write privilege to the user who starts the Spark application.

Sample Code

2. Is there any advantage to use Hive Metastore with Delta Lake?

=> If we incorporate Hive metastore with Delta Table then, the metastore will store the table name, format (delta obviously), location of the table. And if you store your Dataframe as a table then by using the metastore you can read a Delta Table with just the table name and there is no need to remember/specify the complete path and the storage type (HDFS/S3).

With metastore:
val df = spark.table(“dbName.tableName”)

Without metastore:
spark.read.format(“delta”).load(“gs://complete_path”)
OR
spark.read.delta(“gs://complete_path”)

To enable it we need to add following configuration while creating the Spark Session object -

.config(“spark.sql.catalog.spark_catalog”, “org.apache.spark.sql.delta.catalog.DeltaCatalog”)

That’s all from this post! Happy Learning! 😊

Further Reading: https://lnkd.in/gUUZxZM5

Connect me on LinkedIn — Sarfaraz Hussain | My blogs on Delta Lake — Part 1 and Part 2

Thank you! Stay tuned!! 😊

--

--

Md Sarfaraz Hussain

Sarfaraz Hussain is a Big Data fan working as a Data Engineer with an experience of 4+ years. His core competencies is around Spark, Scala, Kafka, Hudi, etc.