Delta Tables And Hive Metastore— Introduction

Bhavya Bordia
2 min readDec 12, 2022

--

The following findings/methods are valid as of 12th December, 2022. Delta Lake is devloping project som new functions may be added or removed. Please refer to official documentation at delta.io

What is Delta Lake?

Delta Lake is an open-source storage framework that enables building a
Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python.

Delta lake stores data in open-source apache parquet format. Its open nature makes it a flexible file protocol for various use cases. Each delta lake folder structure is made up of two parts:

  • Delta Log Folder: This folder contains JSON files. This folder contains the transactions and keeps track of each file metadata that is getting added or deleted.
  • Actual files: These are the existing parquet files that contain the data and the files that contain the data are referenced in the transaction log.

What is Hive Metastore?

Hive Metastore is a service that stores metadata related to apache hive and other services in a backend, in a backend RDBMS like MySQL, Postgres, etc. It is also known as HMS(commonly used in this article).

Apache spark also shares the metastore and the support could be enabled in the spark session. Many companies utilize the metadata information available in HMS to sync them in their catalog which is available for consumers of tables to understand the table.

I will cover the creation of delta tables and their description in the Hive metastore in the next part.

References:

--

--

Bhavya Bordia

Inspired from Dora The explorer and started exploring the writing world on medium.