Delta Tables And Hive Metastore — Create and Evolve

4 min readDec 12, 2022

The following findings/methods are valid as of 12th December, 2022. Delta Lake is devloping project som new functions may be added or removed. Please refer to official documentation at delta.io

Different ways of creating Delta Lake Tables

We will discuss different ways of creating delta lake tables and will be focusing on how the structure will look in HMS as well. The best way is dependent on the use case which you are trying to solve. To get a brief introduction about delta lake and Hive Metastore, refer this.

Creating Delta Table via spark Dataframe/ spark sql

Let us look into below code snapshot written in scala and spark

// df - DataFrame in which data is loaded
df.write
 .format("delta")
 .option("mergeSchema","true")
 .option("path","<s3Location>")
 .saveAsTable("<hiveDb>.<tableName>")

format = “delta” specifying table is delta table
mergeSchema = “true” enable automatic schema evolution
path = “location” actual location to store delta table path
saveAsTable = “Name of the table” normally consists of hive schema and table name.

How the structure looks in HMS

desc formatted <hiveDb>.<tableName>;
OK
# col_name             data_type            comment
col                  array<string>        from deserializer

# Detailed Table Information
Database:            <hiveDb>
OwnerType:           USER
Owner:               root
CreateTime:          Thu Nov 17 14:24:39 UTC 2022
LastAccessTime:      UNKNOWN
Retention:           0
Location:            <s3Location>-__PLACEHOLDER__
Table Type:          EXTERNAL_TABLE
Table Parameters:
 EXTERNAL             TRUE
 spark.sql.create.version 3.1.1-amzn-0.1
 spark.sql.partitionProvider catalog
 spark.sql.sources.provider delta
 spark.sql.sources.schema.numParts 1
 spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\":[]}
 transient_lastDdlTime 1668695079# Storage Information
SerDe Library:       org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:         org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat:        org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Compressed:          No
Num Buckets:         -1
Bucket Columns:      []
Sort Columns:        []
Storage Desc Params:
 path                 <s3Location>
 serialization.format 1

If one closely looks at the description of the table definition, we would not be able to figure out what columns exist in the particular table. It is impossible to understand the column name, associated datatype, and what that column represents.

Advantages:

Tables get created as part of normal spark jobs
Automatic schema evolution comes out of the box from delta lake if the mergeSchema property is enabled

Disadvantages:

Not possible to understand the schema if you are using HMS as metastore
The table is not readable via the hive shell.

Creating Delta Table in Hive using an external connector

Pre-requisite s3location is already a delta table.

To solve this problem delta lake provides an external connector as well to make the delta table compatible with the hive.

One can use delta-hive-assembly_2.12–0.5.0.jar to create external tables. You have to follow the following steps:

Have the jar available on the system where you are connecting to the hive shell. (could be via JDBC, beeline, or via hive cli)
Fire the following commands

ADD JAR <absolute_path_of_downloaded_delta-hive-assembly_*-SNAPSHOT.jar>
SET hive.input.format=io.delta.hive.HiveInputFormat 
SET hive.tez.input.format=io.delta.hive.HiveInputFormat

After this one can launch the command to create a table

CREATE EXTERNAL TABLE IF NOT EXISTS <hiveDb>.<tableName> ( 
`col1` `col1Datatype` <description-comment>,
`col2` `col2Datatype` <description-comment>,
....
`colN` `colNDatatype` <description-comment> )
STORED BY 'io.delta.hive.DeltaStorageHandler'
LOCATION '<s3Location>`

One could define the external table with the whole schema and use the delta storage handler which is coming from the external connector jar to create a hive table with proper schema and comment.

How the structure looks in HMS

desc formatted <hiveDb>.<tableName>;
OK
# col_name             data_type            comment
`col1` `col1Datatype` <description-comment>,
`col2` `col2Datatype` <description-comment>,
....
`colN` `colNDatatype` <description-comment>

# Detailed Table Information
Database:            default
OwnerType:           USER
Owner:               hadoop
CreateTime:          Thu Dec 01 05:14:19 UTC 2022
LastAccessTime:      UNKNOWN
Retention:           0
Location:            <s3location>
Table Type:          EXTERNAL_TABLE
Table Parameters:
 EXTERNAL             TRUE
 bucketing_version    2
 numFiles             42
 spark.sql.sources.provider DELTA
 storage_handler      io.delta.hive.DeltaStorageHandler
 totalSize            277263
 transient_lastDdlTime 1669871659# Storage Information
SerDe Library:       org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat:         null
OutputFormat:        null
Compressed:          No
Num Buckets:         -1
Bucket Columns:      []
Sort Columns:        []
Storage Desc Params:
 path                 <s3Location>
 serialization.format 1

Storage Handler → io.delta.hive.DeltaStorageHandler comes from an external jar.

Advantages

Tables are created with proper schema in HMS
The table is queryable via the hive
The table is queryable via spark after the addition of the same addition external jar

Disadvantages

An additional jar is required by everyone who wants to read the table
Comparatively slower queries compared to normal spark tables
Spark session with just delta core jar would not be able to query this table. It would require an external jar in the session.
Schema evolution is a problem: One can not add columns in tables in the hive which are created by an external storage handler. This schema gets evolved this table needs to be dropped and re-created.

Persisting Problem

Let us say one creates the table via spark, the metadata is not visible in hive metastore. Data Discovery becomes a huge concern.
The second approach can address the first problem but schema evolution is an issue. Comparatively slower queries via external jar and each and every consumer need to add additional hive connector jar in their spark job

Let us try to solve this problem.

Delta Tables And Hive Metastore — Create and Evolve

Different ways of creating Delta Lake Tables

Creating Delta Table via spark Dataframe/ spark sql

Creating Delta Table in Hive using an external connector

Persisting Problem

References:

Written by Bhavya Bordia