Delta Tables And Hive Metastore — Reverse Engineering

Bhavya Bordia
2 min readDec 12, 2022

--

The following findings/methods are valid as of 12th December, 2022. Delta Lake is devloping project som new functions may be added or removed. Please refer to official documentation at delta.io

Creation of an External Delta Hive Table — New Way

Pre-requisite: s3location is already a delta table.

Once you created a delta table using spark (refer to this), you can notice a couple of things

SerDe Library:       org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Serde Library is LazySimpleSerde native hive supported and the input/output format is a Sequence file format.

Another important observation

Table Parameters:
spark.sql.sources.provider delta

In the table parameters, one can find a reference to the delta.

We then extracted the create command of the table created via spark and added schema to the command and removed extra properties.

CREATE EXTERNAL TABLE IF NOT EXISTS <hiveDb>.<tableName> ( 
`col1` `col1Datatype` <description-comment>,
`col2` `col2Datatype` <description-comment>,
....
`colN` `colNDatatype` <description-comment> )
ROW FORMAT SERDE  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'path'='<actual s3 path>' )
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
LOCATION '<placeHolderLocation>'
TBLPROPERTIES (
'spark.sql.sources.provider'='delta'
)

You can fire the above command to create an external table via hive or hive JDBC and register the table.

How does the structure look in HMS?

hive> desc formatted <hiveDb>.<tableName>;
OK
# col_name data_type comment
`col1` `col1Datatype` <description-comment>,
`col2` `col2Datatype` <description-comment>,
....
`colN` `colNDatatype` <description-comment>
# Detailed Table Information
Database: uip_vep_delta
OwnerType: USER
Owner: hadoop
CreateTime: Thu Nov 17 17:01:43 UTC 2022
LastAccessTime: UNKNOWN
Retention: 0
Location: <placeholderlocation>
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
bucketing_version 2
numFiles 0
spark.sql.sources.provider delta
spark.sql.sources.schema.numParts 1
spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\":[]}
totalSize 0
transient_lastDdlTime 1668746389
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
path <actual location>
serialization.format 1

Advantages:

  • No need to external jar in the spark session
  • No impact on querying speed via spark
  • Full schema is present in HMS
  • Schema evolution can happen using (Alter table command — native storage handler)

Disadvantages

  • Not queryable via hive

Conclusion

In the last part we have seen a couple of different ways of creating delta tables, one can determine which is best suited for them.

If you are not looking for a description of data in HMS you can go with creating delta tables via spark with no hassle of evolving schema.

If your use case is more about interacting with the hive shell on day to day basis and you won’t impact the speed of execution then creating a table with an external hive connector is best for you.

If your use case involves registering correct metadata in metastore like HMS and does not care much about querying aspects with respect to hive then you could use the approach mentioned in this article.

--

--

Bhavya Bordia

Inspired from Dora The explorer and started exploring the writing world on medium.