HOW PARQUET IS EFFICIENT?

Ratnark Gandhi
6 min readJan 19, 2022

--

One of the success keys of a great storage format is efficient data storage. And in Parquet there are two major methods making it storage efficient :

  • Encoding
  • Compression

Let’s go through both of these techniques in detail one by one.

ENCODING

Encoding means the transformation of given data to a coded form helping to achieve desired results of security and efficiency.

A well-known example of encoding for us might be converting a text into ASCII or BINARY representation,

ENCODING IN PARQUET

Because Apache Parquet is designed to handle large amounts of data, encodings are primarily used to store data more effectively. We’ll go through some of the available and not deprecated encoding techniques in Parquet.

  • PLAIN: This encoding technique is usually applied at the last when there is no more efficient encoding left to apply on given data. For example, an int32 will always be represented in 4 bytes.

I.e 1 = 00000001 00000000 00000000 00000000

2 = 00000010 00000000 00000000 00000000

  • RLE/BIT-PACKING HYBRID: You would’ve already guessed the major drawback of plane encoding — wasting of space. Just suppose that we’ve stored the age of our users and we know that the maximum value is 99 for this attribute, which can be easily represented within 7 bits but applying PLAIN encoding will take up full 32 bits for storage of one entity. RLE/bit-packing hybrid approach helps to fix this issue.

RLE and bit-packing methods are combined in this strategy. It chooses one of these two approaches to encode a particular range of values depending on the values character (repeatable or not). It indicates that different encodings might exist in the same column! To better grasp what the hybrid technique may accomplish, let’s first go through the concepts of RLE and bit-packing encodings in general.

RLE is an acronym for Run-Length Encoding. It works effectively for data that is repeated. Rather than encoding values one after the other, it counts how many times a specific value appears in a row (this appearance is called data run). The specified value is then encoded as the number of repetitions and the value. When the encoded data contains a large number of data runs, as in the example below:

Bit packing is a form of data compression that reduces the number of bits it takes to serialize a value. Normally an integer will be serialized as 32 bits, but suppose knowing the range of numbers which is 0 to 7 we can pack all the numbers with the bit-width of 3. It’s nothing but to simply adopt a bit-wise format where you have a 1-bit header followed by a known number of bits representing the value.

  • DELTA: The delta encoding encodes the first value then the difference between subsequent entries. In Parquet, some information like block size, mini block count, number of values is stored before the first value.

For datetime columns, such as those stored in milliseconds, delta encoding can be particularly efficient because only the difference between successive occurrences is preserved instead of each value occupying 64 bits every time. Hence, it’s recommended in general for situations when the variance is substantially lower than the absolute values.

COMPRESSION IN PARQUET

Data compression is a technique whose primary goal is to reduce the file’s logical size.

As a result, the smaller file should take up less space on disc and transport over the network more quickly. Compression is frequently used in the same way as encoding. It recognizes and substitutes repeating patterns with more succinct representations.

  • GZIP: It is a lossless compression method based on DEFLATE algorithm having the combination of Huffman coding and LZ77. Tough words to grasp right? Let’s make it simpler by breaking into Huffman and LZ77.

In Huffman’s algorithm, variable-length codes are assigned to input characters in such a way that the code assigned to one character is not the prefix of code assigned to any other character, and ensuring this is very important to save us from ambiguity in data. For example, let there be four characters a, b, c, and d, and their corresponding variable-length codes be 00, 01, 0, and 1. This coding leads to ambiguity because code assigned to c is the prefix of codes assigned to a and b. If the compressed bit stream is 0001, the decompressed output may be “cccd” or “ccb” or “acd” or “ab”. Also, lengths of the assigned codes are based on the frequencies of corresponding characters. The most frequent character gets the smallest code and the least frequent character gets the largest code.

LZ77 — If you’ve already seen a particular substring, let’s just mention where to find it in the original substring. That’s it, the original intuition behind working of LZ77.

  • SNAPPY: Snappy encoding is not bit-oriented, but byte-oriented (only whole bytes are emitted or consumed from a stream). The format uses no entropy encoder, like Huffman tree or arithmetic encoder. The first bytes of the stream are the length of uncompressed data, stored as a little-endian varint, which allows for variable-length encoding. The lower seven bits of each byte are used for data and the high bit is a flag to indicate the end of the length field.

For example, 0000000: ca02 f042 5769 6b69 7065 6469 6120 6973

The first 2 bytes, ca02 are the length, as a little-endian variant. Thus the most-significant byte is ‘02’. 0x02ca(varint) = 0x014a = 330 bytes. The next two bytes, 0xf042, indicate that a literal of 66+1 bytes follows. It does not aim for maximum compression or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression.

  • LZO: It is a lossless compression method based on dictionaries. Compression speed is prioritized over compression ratio. Because it is a variant of LZ77, the LZO belongs to the LZ encoding family. The difference is that it has been tuned to take advantage of the fact that modern processors are designed to do integer operations. It provides faster hash lookup tables as well as more efficient output tokens. The data that cannot be compressed is copied to the output stream in an uncompressed form on a 32-bit long word basis. Unlike the previous two compression methods, LZO requires some more effort. Separately, a native LZO library must be installed.

Hope you enjoyed reading this article, feel free to ask/discuss any thing in comment section.

--

--