An Interesting relationship of Hive Tables with Apache Spark

If you are using Spark { > 2.0.0} with Older version of Hive {< 0.14.0}, you can end up generating a Spark Table using Sequence File Hive Format -

Sample Code For Creating Table

import java.sql._
import spark.implicits._
val df = Seq(
(1, Timestamp.valueOf(“2017–12–02 03:04:00”)),
(2, Timestamp.valueOf(“1999–01–01 01:45:20”))
).toDF(“id”, “time”)
df.createGlobalTempView(“sometable”)
%sql
CREATE TABLE testabc.abc_finance2
USING PARQUET
LOCATION ‘/mnt/bucket/testsql’
OPTIONS (‘compression’=’snappy’)
AS
SELECT * FROM global_temp.sometable

Using Hive Version < 0.14.0

inputFormat = org.apache.hadoop.mapred.SequenceFileInputFormat outputFormat = org.apache.hadoop.mapred.SequenceFileOutputFormat serde = org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Using Hive Version > 0.13.0 or 1.2.0

org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat 
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe

Difference between above 2 formats

Sequence File Format

Sequence files store data in a binary format with a similar structure to CSV. Like CSV, sequence files do not store metadata with the data so the only schema evolution option is appending new fields. However, unlike CSV, sequence files do support block compression. Due to the complexity of reading sequence files, they are often only used for “in flight” data such as intermediate data storage used within a sequence of MapReduce jobs.

If the size of a file is smaller than the typical block size in Hadoop, we consider it as a small file. Due to this, a number of metadata increases which will become an overhead to the NameNode. To solve this problem sequence files are introduced in Hadoop. Sequence files act as a container to store the small files.Sequence files are flat files consisting of binary key-value pairs. When Hive converts queries to MapReduce jobs, it decides on the appropriate key-value pairs to be used for a given record. Sequence files are in the binary format which can be split and the main use of these files is to club two or more smaller files and make them as a one sequence file

Parquet Files Format

Parquet Files are yet another columnar file format that originated from Hadoop creator Doug Cutting’s Trevni project. Like RC and ORC, Parquet enjoys compression and query performance benefits, and is generally slower to write than non-columnar file formats. However, unlike RC and ORC files Parquet serdes support limited schema evolution. In Parquet, new columns can be added at the end of the structure. At present, Hive and Impala are able to query newly added columns, but other tools in the ecosystem such as Hadoop Pig may face challenges. Parquet is supported by Cloudera and optimized for Cloudera Impala. Native Parquet support is rapidly being added for the rest of the Hadoop ecosystem

Read the Parquet Motivation here https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-NativeParquetSupport

Why Hive 0.13.0 doesn’t generate ParquetHiveSerDe

Because it doesn’t support Hive Table column with date and timestamp format. If you want, you can cast the date/timestamp column to string column and then you can force to save as ParquetHiveSerDe.

CREATE EXTERNAL TABLE newdb.hibvetabletospark
ROW FORMAT SERDE ‘org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe’
STORED AS
INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat’
OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat’
LOCATION ‘/mnt/bidding/testsql2/testsql9’
AS
SELECT * FROM sometable

Old Spark Generation with Hive

Loading the metadata of big Hive Table used to take lot of time and it was a repeated scenario each time new partition was added. So as soon as you fire query the initial query is blocked until Apache Spark loads all tables’ partitions metadata & for larger partitioned tables, its a recursive scanning which make it worse.

Current On-Going Hive Work Related Ticket

ABC , a Software Engineer

Written by

Big Data & Machine Learning/AI using Hadoop, Spark, Scipy, Drill, Beam, Airflow, Hbase and others.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade