The research leading to these results has received funding from the European Community’s Horizon 2020 research and innovation programme under grant agreement No 826284
Organizations nowadays collect and store large amounts of data about their customers and their internal business operations. In many cases the data is stored in columnar format, such as Apache Parquet and Apache ORC, where it could be later consumed by an analytics engine such as Apache Spark. However, the stored data is often sensitive or confidential and should be protected. To that end, IBM has been playing a leading role in the community effort to embed encryption into the Apache Parquet format, which we call the Parquet Modular Encryption format. This format supports encryption at the column level, and therefore enables the user to decide which columns to encrypt and how to control the column access. A detailed specification of the format can be found here. Also, a use case scenario and motivations for this library are described in this talk from Spark+AI summit 2018.
The Parquet Modular Encryption format supports two encryption modes: one based entirely on AES GCM cipher (referred to as the GCM mode), and another based on combination of AES GCM and AES CTR ciphers (referred to as the GCM-CTR mode). In GCM mode, all parts of the parquet file (data and metadata) are encrypted using an AES-GCM cipher. In the GCM-CTR mode the data is encrypted using an AES-CTR cipher, and the metadata is encrypted using an AES-GCM cipher. While AES CTR provides only data confidentiality, AES GCM supports also data integrity at a price of performing additional work — MAC calculation.
This way, the user can balance the trade-off between better security and higher application throughput. In this blog we explore the overhead of adding data encryption, with and without integrity verification.
There are Parquet encryption implementations under development both in Java and C++. In this blog we focus our attention on the performance of the Java implementation of the encryption format, which is based on parquet-mr version 1.11. The setup we used for the benchmarks is a machine with 8-core Intel Core i7 6820HQ CPU / 2.70GHz and 32GB memory running Ubuntu 16.04 OS. We compare the performance of OpenJDK 1.8.0_212 and OpenJDK 11.0.4 (will be referred to as Java 8 and Java 11, respectively). These Java versions were chosen because Apache Spark currently runs on Java 8, and plans to switch to Java 11 in a future release.
Raw Java AES Performance
We began the assessment with measuring the throughput of the encryption primitives themselves: the AES GCM and AES CTR ciphers. In particular, we used the javax encryption library with its default crypto provider (SunJCE). We ran the benchmarks with 128-bit encryption key, for Java 8 and Java 11, for various chunk sizes, both for one and eight threads. We are interested in performance of multiple threads because analytics platforms (such as Apache Spark) often use multiple threads. The reason for testing various chunk sizes is that parquet files consist of pages which are a few MB long (4 MB in the benchmarks presented below), and of meta-data modules (such as the page header) which are usually a few dozen bytes long.
Below, we list the conclusions of the raw java benchmarking. The graphs with the full results could be found in the appendix of the blog:
- encryption and decryption in Java 11 are an order of magnitude faster than in Java 8. This is mainly due to the lack of hardware acceleration (AES-NI) support in Java 8, which was introduced in later java versions, and is available in Java 11. For example, the throughput of single thread AES-GCM cipher in Java 8 is around 70 MBps at its peak, whereas in Java 11 it is around 1 GBps.
- AES-CTR algorithm is faster than the AES-GCM algorithm. This is because the GCM algorithm also computes a tag for each encrypted block, which provides data integrity in addition to data confidentiality. For example, the throughput of single thread AES-CTR cipher in Java 8 is around 300 MBps at its peak, whereas it is 70 MBps with AES-GCM. In Java 11, the difference also exists, but becomes basically irrelevant from a practical point of view, because AES-GCM is hardware accelerated and runs much faster than the application workload (see the next section).
We now turn to benchmark the parquet library with the modular encryption capabilities. We based our benchmarks on the standard parquet-mr benchmarks. In the read benchmarks, a set of parquet files are generated beforehand, and are consumed by the benchmark process. In the write benchmark the same files are generated during the benchmark itself. Each benchmark has a different choice of parquet properties. These properties govern the number of rows, the compression algorithm (if any), the page size and the block size (aka the row-group size). All of the parquet files in these benchmarks contain a million rows.
The benchmarks are based on the JMH framework. We ran the benchmarks as suggested by the standard benchmarks documentations: three forks, five warmup iterations and five measuring iterations, and the results are averaged over these iterations.
For convenience we refer to the benchmarks as W1-W7 and R1-R7. Below are the benchmarks we present in this blog:
We also tested our library with the rest of the standard benchmarks (R2–4, R7 and W2–4, W7). However, the rest of the benchmarks use similar settings to R1 and W1, with different page and block sizes — which don’t affect the encryption throughput. Consequently, the results are essentially the same as R1 and W1, and are omitted for brevity. All of the results below are single threaded.
We ran each benchmark in three different encryption settings (in Java 8 and Java 11):
- Without encryption at all (denoted as the unencrypted mode)
- Encrypting both the parquet data and the metadata with AES-GCM (GCM mode)
- Encrypting the parquet data is encrypted with AES-CTR and the metadata with AES-GCM (GCM-CTR mode)
For the benchmarks with encryption, we also specify the relative encryption time overhead, which is the ratio between the encrypted and the unencrypted variants. For example, in R5 in the GCM mode with Java 8 we have approximately 30% overhead, and with Java 11 we have approximately 3% overhead.
There are multiple interesting patterns to observe from the above results:
- Java 11 hardware accelerated encryption delivers much better performance than Java 8. Also, Java 11 is faster than Java 8 even for the for base Parquet-mr code.
- The GCM-CTR mode has less overhead than the GCM mode. In Java 11 the difference between the modes is less noticeable than in java 8 due to the HW efficiency.
- The relative overhead is proportional to the ratio between the base parquet code throughput and the cipher throughput. For example, the throughput of the parquet code in R1 with Java 8 is around 47 MBps (calculated by dividing the file size by the execution time), and the throughput of the CTR decryption is around 300 MBps. The ratio between them is around 0.15, and the decryption overhead in the CTR mode of R1 is around 16%. Therefore, if for the same data the base parquet code performs additional work (e.g. adding compression), the relative overhead is less significant.
Column Subset Encryption
Next, we turn our attention to the Parquet Modular Encryption capability of encrypting a subset of the table columns. This is a core feature of our library, as in many real-life use-cases, only a few columns hold sensitive data, whereas other columns don’t need to be encrypted. First, we tried to adapt the standard parquet benchmarks. The standard benchmark files consist of eight columns, so we could choose to encrypt one column to demonstrate the overhead of modular encryption. Unfortunately, most of the columns in the standard benchmarks hold data which is highly repetitive (often a single value). Therefore, the data in these columns is encoded using thin dictionary and data pages, meaning that the other columns (only one or two) account for most of the file size. Since the encryption is performed after the data is encoded and compressed, almost all of the of the cipher time is spent on these one or two columns. However, this does not represent a real distribution of data in parquet files, so we decided to use a real dataset instead to benchmark this capability.
One of the best places to find real datasets is Kaggle. It is a wonderful place for data scientists and people interested in understanding properties of real datasets. We decided to benchmark our library with this dataset, which is hosted by the city of Los Angeles and contains data about parking citations in the city. The dataset consists of 19 columns, and we decided to encrypt 2 columns (Ticketnumber and IssueDate), which represents approximately 10% of the data. For the read benchmarks, we created parquet files from this dataset, with parquet-properties similar to those in the standard benchmarks. In the write benchmarks, we read records from a csv file (which is located inside a tmpfs filesystem) and write these records to parquet files. We present only the GCM results for brevity. The benchmark numbers (W1, W5, W6, R1, R5, R6) have parquet properties as in the standard benchmarks above (albeit different number of rows). We used the same JMH settings as in the standard parquet benchmarks.
Now, we turn our attention to the next point of interest — what’s the difference between encrypting all columns and only a subset of them. We can see from the table that the overhead is proportional to the number of the encrypted columns. Meaning that if we encrypt certain percentage of the columns, we should expect a lower overhead of encryption. This means that a user can gain an increased performance by encrypting only the sensitive columns in his table.
Encrypting sensitive data is obviously important. But keeping the processing throughput high is also a competitive necessity. The Parquet modular encryption library was designed so it would enable its users to make their own decision and balances between these needs. We hope this blog shed light on the different performance aspects which should to be taken into consideration, and showed what are the estimated performance overhead of these decisions.
Appendix — raw Java encryption and decryption benchmarks
Below are the throughput graphs for the raw Java cipher. Each graph contains benchmarks for either AES-CTR or AES-GCM, in either Java 8 or java 11. The benchmarks are for both encryption and decryption (denoted as Enc and Dec), with one thread and eight threads (denoted as 1-Thr and 8-Thr). We suspect the performance deterioration in eight threads in Java 11 between 100MB and 1GB is due to low level cache contention.