An Interesting relationship of Hive Tables with Apache Spark
If you are using Spark { > 2.0.0} with Older version of Hive {< 0.14.0}, you can end up generating a Spark Table using Sequence File Hive Format -
Sample Code For Creating Table
import java.sql._
import spark.implicits._
val df = Seq(
(1, Timestamp.valueOf(“2017–12–02 03:04:00”)),
(2, Timestamp.valueOf(“1999–01–01 01:45:20”))
).toDF(“id”, “time”)
df.createGlobalTempView(“sometable”)
%sql
CREATE TABLE testabc.abc_finance2
USING PARQUET
LOCATION ‘/mnt/bucket/testsql’
OPTIONS (‘compression’=’snappy’)
AS
SELECT * FROM global_temp.sometable
Using Hive Version < 0.14.0
inputFormat = org.apache.hadoop.mapred.SequenceFileInputFormat outputFormat = org.apache.hadoop.mapred.SequenceFileOutputFormat serde = org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDeUsing Hive Version > 0.13.0 or 1.2.0
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDeDifference between above 2 formats
Sequence File Format
Sequence files store data in a binary format with a similar structure to CSV. Like CSV, sequence files do not store metadata with the data so the only schema evolution option is appending new fields. However, unlike CSV, sequence files do support block compression. Due to the complexity of reading sequence files, they are often only used for “in flight” data such as intermediate data storage used within a sequence of MapReduce jobs.
If the size of a file is smaller than the typical block size in Hadoop, we consider it as a small file. Due to this, a number of metadata increases which will become an overhead to the NameNode. To solve this problem sequence files are introduced in Hadoop. Sequence files act as a container to store the small files.Sequence files are flat files consisting of binary key-value pairs. When Hive converts queries to MapReduce jobs, it decides on the appropriate key-value pairs to be used for a given record. Sequence files are in the binary format which can be split and the main use of these files is to club two or more smaller files and make them as a one sequence file
Parquet Files Format
Parquet Files are yet another columnar file format that originated from Hadoop creator Doug Cutting’s Trevni project. Like RC and ORC, Parquet enjoys compression and query performance benefits, and is generally slower to write than non-columnar file formats. However, unlike RC and ORC files Parquet serdes support limited schema evolution. In Parquet, new columns can be added at the end of the structure. At present, Hive and Impala are able to query newly added columns, but other tools in the ecosystem such as Hadoop Pig may face challenges. Parquet is supported by Cloudera and optimized for Cloudera Impala. Native Parquet support is rapidly being added for the rest of the Hadoop ecosystem
Read the Parquet Motivation here https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-NativeParquetSupport
Why Hive 0.13.0 doesn’t generate ParquetHiveSerDe
Because it doesn’t support Hive Table column with date and timestamp format. If you want, you can cast the date/timestamp column to string column and then you can force to save as ParquetHiveSerDe.
CREATE EXTERNAL TABLE newdb.hibvetabletospark
ROW FORMAT SERDE ‘org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe’
STORED AS
INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat’
OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat’
LOCATION ‘/mnt/bidding/testsql2/testsql9’
AS
SELECT * FROM sometableOld Spark Generation with Hive
Loading the metadata of big Hive Table used to take lot of time and it was a repeated scenario each time new partition was added. So as soon as you fire query the initial query is blocked until Apache Spark loads all tables’ partitions metadata & for larger partitioned tables, its a recursive scanning which make it worse.
