Encoding, Compression, Parquet and Hive
On CDH 5.14 cluster, I was doing size comparison for inserts done using hive Vs impala to table with parquet file format. I was under impression, being both file formats are same, it should have similar behavior (with default compression codec) to result in approx same size. However noticed that — Impala written parquet file is much small than one written by hive for same input data.
So started looking at parquet file written by hive and come across observations which actually helped to understand point where perhaps all beginners should have been come across kind of confusion on encoding Vs compression (here related to hive hdfs tables). I.e. Specific difference between encoding and compression and how hive operates on this.
What I think in general difference between encoding and compression is –
- Encoding: It’s more at application level where data representation is changed. The encoding can also minimize space usage which can give us a kind of compression.
- Compression : In general it’s the Technic to reduce storage for given data in bytes irrespective of underline data is already encoded or not.
The aforesaid confirm with below observation.
- Created hive with parquet file format.
- Inserted some data which had created approx. 1GB of parquet files as part of insert.
- To have look at parquet file, I copied file from hdfs to local to dump it (just to note — there are way to dump from hdfs as well). Then validated using parquet-meta utility which can dump parquet file.
- ../parquet-meta <file name>
- Where in it shows, each column is encoded with various encoding (value of ENC) Technics like Run length, Bit packed etc (we will not discuss encoding itself). However, hive written parquet file by default doesn’t employ any compression. The size of the file was 1GB.
- Truncated table used in aforesaid test.
- At hive(beeline) command line set below settings which will explicitly set compression codec and force hive to compress hive written parquet file.
- Inserted same amount of data. This time it has created approx 350 MB of parquet files.
- Again copied file from hdfs to local to dump it using parquet-meta utility.
- Here we can see that, it shows the file is compressed with SNAPPY along with encoding. When we explicitly enable compression, there will be substantial reduction in size of parquet file for same amount of data as result of compression.
So observation is that encoding and compression is different and it can be used separately or together. I have observed in hive ( as of CDH 5.14), when we write to parquet table, by default it uses only encoding and not compression. To enable compression on hive, we need to explicitly set at hive as below.
SET parquet.compression=<Compression Codec>
Which would then results in almost same size as that of Impala inserts (Impala by default does both encoding and compression).