Apache Parquet and Encryption

Ratnark Gandhi
6 min readSep 19, 2021

--

Apache parquet is one of the most popular hybrid columnar storage formats. But, before getting started on Apache Parquet let’s dig deeper into the term “Columnar storage” because we need strong understanding of it which will help us with the various concepts present in later part of the story. It might be a familiar word for many of you, and like it’s said everywhere ‘The columnar format gives an upper edge while performing analytical functions (OLAP) on a given dataset. But how? Let’s answer this question with the help of an example:

Dataset

Suppose we got a small dataset about purchases made from different branches at different locations of a single store say BUYMART.

The analytics team of BUYMART wants to perform some analysis on sales and for the analysis, they only need ‘AGE’ column.

If it’s a conventional row-based storage file format (like CSV) and you want to read different elements of the same column from the given dataset, it’d only be possible after traversing through the full data, i.e.

Row Based

Data is stored in memory exactly as shown above, having rows adjacent to one another. Considering one cell as one unit of data, the memory pointer at hard disk will have to move 18 units in order to fetch the age of all three customers, which will be used for analysis at a later stage.

The performance for reading data from hard disk would remain same for all read operations across row-based file formats and in order to get data available in a specific column cost of reading would also involve going through the data which is not of use for us. Now, let’s see how the data is placed in a columnar file format:

Columnar Based

Assuming the cost of going through each cell still be one unit, if we’ve to fetch the age column the memory pointer at hard disk will now only have to move 6 units in order to find out the age of all three customers. And if the data set is on the scale of millions of rows, analysis function based on columnar storage format will have much better results as compared with row-based storage formats.

Coming back to ‘Apache Parquet’, it’s not just a mere ‘Columnar based’ storage file format, but it’s the hybrid of row and column-based storage format.

It means that we can decide the size of chunk of columns to be kept together. It’ll become more clear with the following example :

Hybrid

In hybrid table shown above, two rows of each column will together form one chunk of the column and the one collection of those chunks to have all elements of two rows would be called a row group. In the above figure, A-B is one columnar chunk (underlined) and the whole highlighted portion is the one ‘row group’. The structure of Parquet file is explained in the figure below.

Parquet File Structure

But how is the hybrid structure better than the columnar file format?

  • It facilitates us to decide whether we want the full data or we want the subset of full data. In case if we want to read just the subset of full data, we won’t have to go through the columnar values which we don’t need.
  • We can access the row groups individually, suppose the data is sorted by some id and we know the range of ids for which we need data, row groups would provide us with the smallest possible group to look for ids and eliminating the other entries which are not currently needed by us.

Other than providing a hybrid file structure for data storage, it also provides us with the functionality of:

  • Advanced Filtering: We can easily skip the columns we don’t need. Suppose we don’t need the column with the Phone Number of customers from our dataset, we can have the data set of remaining columns just by filtering out the Phone number column.
  • Encoding: By default, Parquet writes the binary representation of data on disk, which is cheaper in case of reading and writing operations with respect to the machine you are using and also hides the data in plain sight.
  • Compression: There is a collection of various compression techniques available to apply based on your needs and your data which will eventually help in shrinking the data size and hence, resulting in less storage space needed to keep the same amount of data.
  • Encryption: Let’s have another complete section for this newly introduced functionality.

ENCRYPTION

Apache Parquet recently introduced the utility to encrypt the data in the file itself. It means that while transferring data earlier, IT professionals used to first separate the sensitive data from general data and then apply encryption on sensitive data to keep it away from analysts/developers and provide them only with the general data. But the introduction of Parquet encryption now facilitates encryption of sensitive data in the file itself, in easier words out of 3 columns A, B, and C column A can remain encrypted in the same file and the users can have access to data present in column B and C. Features of encryption provided by Apache Parquet includes:

  • Protection of sensitive data at rest: Data at rest means that data is stored physically on computer storage, and using the encryption feature provided by Parquet we can have that data/sensitive information present in Parquet file hidden from everyone and anyone using that computer.
  • Data Integrity: Parquet encryption also facilitates tamper-proofing of data, ensuring that the data has not been modified/deleted by any means while being transferred from one source to the respective target.
  • Storage independent encryption: The encryption of data present in Parquet file format is independent of the storage platform being used to store that file. File can be kept on any cloud, personal, archives, file system, etc and still, encryption will ensure that the sensitive information stays encrypted.
  • Key based encryption: Functionality to enable per column encryption has been introduced facilitating key based access.
  • Full Parquet capabilities: even after providing all the features of encryption in Parquet file format, it has been ensured that the added functionality doesn’t dwindle the performance of analytics engines and preserves full Parquet capabilities.
  • Full Encryption: All type of data, metadata, list of sensitive columns, encryption key ids, etc are hidden.
  • Separation of sensitive columns: Separate keys are provided for sensitive columns, we can assume that there are two buckets, one bucket contains the key of all sensitive columns through which you can access the metadata and values stored in that column and the other contains key for general columns.
  • Footer Encryption: Parquet file footer is encrypted with separate footer key .
  • Environment free: Admin of storage server doesn’t have permission to view encryption keys or unencrypted data.

CURRENT STATUS:

  • For Parquet MR java implementation of encryption has been already released in 2021 with version 1.12.0
  • For Apache Arrow, C++ implementation was also merged in 2021, and Python interface for same is under construction.
  • Planned to release this feature of encryption with Spark 3.2.0

--

--