Re-implementing CryptDB

Robert Liu
Princeton Systems Course
9 min readMay 24, 2019

by Ryan Amos and Robert Liu

Introduction

CryptDB [PRZB11] is a system to provide confidentiality to applications that use third-party database management systems (DBMSes). It consists primarily of a proxy server between the application and DBMS servers that encrypts database updates, allowing queries to be run over encrypted data, with extensions in both the application and DBMS server that allow for decryption of data based on application users’ logins. Since its publication at SOSP 2011, CryptDB has been cited nearly 1,000 times, and parts of their system have been adopted in other database management systems at Microsoft, SAP, and MIT Lincoln Labs. This is a highly influential paper that made waves in both database systems and applied cryptography, and led to multiple publications in both fields. Re-implementing CryptDB was a larger task than we bargained for. Our project, OnionDB, successfully re-implements a subset of the core adjustable query-based encryption and SQL-aware encryption strategy of CryptDB, though its performance can’t match the original.

The source code for our project can be found at https://github.com/somanayr/cs518-cryptdb.

CryptDB’s Key Ideas

The three main ideas from CryptDB are:

  • A SQL-aware encryption strategy. SQL has a defined set of primitive operators for use in queries, like equality checks, order comparisons, sums, and joins. CryptDB uses existing cryptographic schemes to perform these operations on encrypted data, including a variety of schemes with weaker confidentiality guarantees but stronger utility. Our implementation can handle a variety of these operations. We use standard and third-party libraries for cryptography, in keeping with the maxim “never roll your own crypto”.
  • Adjustable query-based encryption. Popa et. al. observed that some encryption schemes leak more information than others. To provide maximum security, CryptDB adjusts the encryption scheme used for each query to use the scheme with least information leakage. This is efficiently achieved for each column using onion encryption. In contrast to onion encryption in communication systems, CryptDB layers distinct cryptographic algorithms, as opposed to distinct keys. The database can be handed the decryption key to “peel” away the outermost layer, revealing a weaker but more useful encryption scheme inside. OnionDB fully implements this feature.
  • Chain encryption keys to user passwords. CryptDB uses an encrypted key table stored in the DBMS and maintained by the proxy server to determine access privileges for application users. Their idea is to encrypt data items using users’ password-derived keys, so that only users with access privileges can decrypt the data items encrypted in their names. We did not implement this feature in OnionDB, as it requires additional serialization and database entries, and we found the implementation of this feature to be independent of the other two features.

Threat Model

CryptDB is designed to operate against two main threats.

  • DBMS server compromise. In this threat model, a passive adversary gains access to all data stored in the DBMS server. CryptDB defends against this threat by executing queries over encrypted data. No plaintext data should ever be revealed to the DBMS, though there will be some information leakage depending on the encryption scheme used to satisfy a query. Work by Naveed et. al. [NKW15] describes a possible frequency analysis attack on DBMS metadata available under this threat model.
  • Arbitrary threats. Any or all of the application, proxy server, and DBMS server may be compromised, with possible loss of the users’ decryption keys. Key chaining is designed to mitigate the data loss in an arbitrary compromise: only the logged-in users’ keys are stored, so only the data that those users can see is compromised. However, because we don’t implement key chaining in OnionDB, our implementation doesn’t prevent this threat.

Analysis of CryptDB

CryptDB has very good coverage of SQL commands. In an analysis of a SQL trace from MIT’s database, CryptDB couldn’t support operations on just 1,094 of 128,840 database columns, mostly because of bitwise operations and string processing in the query. It is also easy for application developers to switch to: under 10 lines of code changed, and under 20 schema annotations provided to the proxy server.

Popa et. al. benchmarked CryptDB against pure MySQL (though it also supports Postgres) and a strawman proxy server without adjustable query-based encryption. Overall, CryptDB query throughput was 21–26% lower than MySQL on a query mix from the TPC-C benchmark. Further analysis also broke down processing times by query type, as shown in the figure below.

OnionDB System Architecture

We wrote OnionDB in Java. Originally, we planned to use Rust as it is an exciting new systems programming language, but its steep learning curve forced us to revise that choice. Using Java allowed us to leverage existing libraries and experience in our implementation. H2, our choice of database, is written in Java, as are our order-preserving encryption library and our parsing framework.

The main engineering challenges we found in re-implementing CryptDB were:

  • Developing a flexible framework to handle changing encryption schemes.
  • Building an intelligent parser that can identify connected parts of a SQL statement, identify the needed encryption schemes, and rewrite the query on the fly.
  • Maintaining multiple database schemas with different dimensions and mapping between them.

All of these took significant programming effort, discussed below.

Implementing Onion Encryption

We implemented onion encryption with an abstraction that allows our parser to easily track each onion and identify the correct onion for a desired encryption scheme, or peel the onion to the correct depth as necessary. When the parser requests an unexposed layer of an onion, the onion sends the decryption key for each intermediate layer directly to the database. The order of nesting for the schemes ensures that when the parser requests the current layer or a removed layer, no action is needed. Our implementation maintains two onions per column the user requests, which translates to two columns in the database per column the user requests; however this could be easily expanded for an arbitrary number of onions.

Implementing Parsing

To be honest, neither of us knew what we were really getting into with CryptDB and SQL. It turns out that there are about 20 main query commands in SQL, each of which has different kinds of subcommands, functions, and options and the possibility of nested queries. Writing a complete SQL parser would be a full project in itself. Instead, we used a third party library, JSqlParser. Unfortunately, the library’s documentation is limited to “this uses the Visitor design pattern” and a few dozen lines of example code. To make an intelligent, stateful parser that substitutes in proper encryption for items like table names, column names, and input strings, we had to read, decode, and rewrite much of the library. Sadly, no exceptionally funny bugs were encountered in this process.

We support CREATE, INSERT, SELECT, and DELETE commands.

Maintaining Parallel Schemas

In order to translate plaintext queries from the application to encrypted queries for the database, we distinguish between the interface the user sees, which we call the “virtual,” and the actual values and names stored in the database, which we call the “physical.” The proxy server must maintain a mapping from the application’s virtual schema to the physical, encrypted schema on the DBMS server. Our implementation maps each virtual column name to two virtual subcolumn names, one for each of the onions we use. Each of these columns holds a different ciphertext, and care must be taken to use the correct onion for each query. Virtual table names map directly to physical table names. The proxy’s schema mapping is updated when CREATE and INSERT commands are issued. We additionally must maintain an extra physical row, ROWID, to ensure we can correctly encrypt and decrypt values.

We offer two options for naming tables and columns. The first option is to encrypt each name. However, this may leak the length of the name, and, without care, may leak which columns share names across tables. Therefore, we recommend our second option: random names. Rather than encrypting names, each name, table or subcolumn, is mapped to a unique, random string, and forward and backwards indexes are built. This incurs additional memory overhead in the annotated schema, but provides improved privacy.

Evaluation

We evaluated OnionDB using separate virtual machines for the application, proxy, and database, all hosted on Google Cloud Platform. The application and database were n1-standard-1 VMs with 1 virtual CPU and 3.75 GB of memory, and the proxy server was a g1-small shared-core VM with 0.5 virtual CPU and 1.70 GB of memory.

We ran about 60,000 queries taken from a test dataset created by researchers at Aalborg University on our test platform, and benchmarked the average and maximum run time of query types.

The results show our implementation has a long way to go before it approaches CryptDB’s efficiency. Running a select query means that every data item from the selected columns has to be compared.

We suspect the cause of the latency is due to string serialization and decryption, as opposed to CryptDB, where encryption and decryption only made up 23% of processing time. This is due to a compatibility hack we had to do in order to store our encrypted data in H2. We manage encrypted data as byte arrays. Originally we wanted to store them in the database as binary values, but H2 doesn’t support inputting hex numbers larger than 0x7FFFFFF (the largest signed 4-byte integer). Instead, we convert ciphertext byte arrays to Base64 strings, which has some overhead. More fine-grained benchmarks would help us figure out exactly how much overhead it is, but we weren’t able to implement them due to time constraints.

We suspect most of the additional latency here must come from the overhead of parsing and reconstructing the query and managing the schema.

Inserting elements into tables is, on average, about the same time as in MySQL and the lazy proxy. Tail latency is attributable to large queries and lots of encryption and serialization operations.

Stumped by Outdated Regulations

We discovered that obsolete U.S. government export regulations on cryptography played a role in a bug in the project. Until 1992, cryptographic encryption algorithms were regulated as munitions, dating back to when cryptography was only performed on specialized machines, and some side effects have lingered. Until 2018, Oracle’s JDK restricted AES keys to 128 bits when the standard is 256! Of course, one developer had an old version of Oracle’s JDK, and the developer who wrote the onion encryption code was using OpenJDK, which doesn’t have the restriction in place. There was some confusion about the source of the bug, then laughter.

Conclusion

As Professor Freedman reminded us (and the class) after our presentation, one reads a paper and thinks “that must be pretty simple to implement!” until one tries it, and lo, it was difficult. Popa et. al.’s GitHub repository for CryptDB clocks in at 28,000 lines of C++, and ours is over 4,000 lines of Java supplemented by outside libraries, and it’s still not complete. Much credit is due to the original authors for the sheer scope of the work, surely a factor in their project’s success. However, we humbly claim to have realized some core ideas of CryptDB in OnionDB, as evidenced by our success, however limited it may be. The area most in need of improvement is the parser, which is currently limited in the types of queries it can support.

Sources

[PRZB11] Raluca Ada Popa, Catherine M. S. Redfield, Nickolai Zeldovich, and Hari Balakrishnan. CryptDB: Protecting confidentiality with encrypted query processing. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pages 85–100, 2011.

[NKW15] Muhammad Naveed, Seny Kamara, and Charles V. Wright. Inference attacks on property-preserving encrypted databases. 2015.

--

--