Delta Tables And Hive Metastore — Create and Evolve

Bhavya Bordia
4 min readDec 12, 2022

--

The following findings/methods are valid as of 12th December, 2022. Delta Lake is devloping project som new functions may be added or removed. Please refer to official documentation at delta.io

Different ways of creating Delta Lake Tables

We will discuss different ways of creating delta lake tables and will be focusing on how the structure will look in HMS as well. The best way is dependent on the use case which you are trying to solve. To get a brief introduction about delta lake and Hive Metastore, refer this.

Creating Delta Table via spark Dataframe/ spark sql

Let us look into below code snapshot written in scala and spark

// df - DataFrame in which data is loaded
df.write
.format("delta")
.option("mergeSchema","true")
.option("path","<s3Location>")
.saveAsTable("<hiveDb>.<tableName>")
  • format = “delta” specifying table is delta table
  • mergeSchema = “true” enable automatic schema evolution
  • path = “location” actual location to store delta table path
  • saveAsTable = “Name of the table” normally consists of hive schema and table name.
Suggested Way by Delta Lake

How the structure looks in HMS

desc formatted <hiveDb>.<tableName>;
OK
# col_name data_type comment
col array<string> from deserializer
# Detailed Table Information
Database: <hiveDb>
OwnerType: USER
Owner: root
CreateTime: Thu Nov 17 14:24:39 UTC 2022
LastAccessTime: UNKNOWN
Retention: 0
Location: <s3Location>-__PLACEHOLDER__
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
spark.sql.create.version 3.1.1-amzn-0.1
spark.sql.partitionProvider catalog
spark.sql.sources.provider delta
spark.sql.sources.schema.numParts 1
spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\":[]}
transient_lastDdlTime 1668695079
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
path <s3Location>
serialization.format 1

If one closely looks at the description of the table definition, we would not be able to figure out what columns exist in the particular table. It is impossible to understand the column name, associated datatype, and what that column represents.

Advantages:

  • Tables get created as part of normal spark jobs
  • Automatic schema evolution comes out of the box from delta lake if the mergeSchema property is enabled

Disadvantages:

  • Not possible to understand the schema if you are using HMS as metastore
  • The table is not readable via the hive shell.

Creating Delta Table in Hive using an external connector

Pre-requisite s3location is already a delta table.

To solve this problem delta lake provides an external connector as well to make the delta table compatible with the hive.

One can use delta-hive-assembly_2.12–0.5.0.jar to create external tables. You have to follow the following steps:

  • Have the jar available on the system where you are connecting to the hive shell. (could be via JDBC, beeline, or via hive cli)
  • Fire the following commands
ADD JAR <absolute_path_of_downloaded_delta-hive-assembly_*-SNAPSHOT.jar>
SET hive.input.format=io.delta.hive.HiveInputFormat
SET hive.tez.input.format=io.delta.hive.HiveInputFormat
  • After this one can launch the command to create a table
CREATE EXTERNAL TABLE IF NOT EXISTS <hiveDb>.<tableName> ( 
`col1` `col1Datatype` <description-comment>,
`col2` `col2Datatype` <description-comment>,
....
`colN` `colNDatatype` <description-comment> )
STORED BY 'io.delta.hive.DeltaStorageHandler'
LOCATION '<s3Location>`

One could define the external table with the whole schema and use the delta storage handler which is coming from the external connector jar to create a hive table with proper schema and comment.

How the structure looks in HMS

desc formatted <hiveDb>.<tableName>;
OK
# col_name data_type comment
`col1` `col1Datatype` <description-comment>,
`col2` `col2Datatype` <description-comment>,
....
`colN` `colNDatatype` <description-comment>
# Detailed Table Information
Database: default
OwnerType: USER
Owner: hadoop
CreateTime: Thu Dec 01 05:14:19 UTC 2022
LastAccessTime: UNKNOWN
Retention: 0
Location: <s3location>
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
bucketing_version 2
numFiles 42
spark.sql.sources.provider DELTA
storage_handler io.delta.hive.DeltaStorageHandler
totalSize 277263
transient_lastDdlTime 1669871659
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: null
OutputFormat: null
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
path <s3Location>
serialization.format 1

Storage Handler → io.delta.hive.DeltaStorageHandler comes from an external jar.

Advantages

  • Tables are created with proper schema in HMS
  • The table is queryable via the hive
  • The table is queryable via spark after the addition of the same addition external jar

Disadvantages

  • An additional jar is required by everyone who wants to read the table
  • Comparatively slower queries compared to normal spark tables
  • Spark session with just delta core jar would not be able to query this table. It would require an external jar in the session.
  • Schema evolution is a problem: One can not add columns in tables in the hive which are created by an external storage handler. This schema gets evolved this table needs to be dropped and re-created.

Persisting Problem

  • Let us say one creates the table via spark, the metadata is not visible in hive metastore. Data Discovery becomes a huge concern.
  • The second approach can address the first problem but schema evolution is an issue. Comparatively slower queries via external jar and each and every consumer need to add additional hive connector jar in their spark job

Let us try to solve this problem.

References:

--

--

Bhavya Bordia

Inspired from Dora The explorer and started exploring the writing world on medium.