We use the approach describe here at Affiga. Build personalized consumer experiences at any scale.

Blind Indexes in 3 minutes: Making Encrypted Personal Data Searchable

Joshua Kelly
2 min readJan 17, 2020

--

Say you’re building a platform which is going to store large amounts of personally identifiable information. And maybe you want to sell platform access to large enterprises. Well, you better make sure that you’ve got a pretty defensive strategy for storing that data. If you leak it, you could kill the business.

Yes, you’re going to encrypt the database at rest. Yes, you’re going to restrict network access. Okay, but all it takes is a single successful hot database connection to make data visible to an attacker. Sure, if someone steals the disk you’re fine — but if they make a real, live connection that’s not going to help you in the least.

And that’s to say nothing of the fact that all of your internal business intelligence tools, and admin users who you might grant read-permissions to, will still all have their own visibility.

So what do you do?

The obvious strategy is to not only encrypt the database at rest, but to also encrypt any columns storing sensitive data. Just store the ciphers. So that even having an active connection to the database is not enough to leak data. Clients who need to read or write the data will also, additionally, need access to a private key to read the data stored in the columns. Your database won’t have any knowledge of this private key.

Merely connecting to the database won’t be enough. Yay!

Uh, but wait, now you can’t search those columns efficiently. Search won’t be possible in the database, only in your clients. That’s not going to scale.

A Simple Improvement: The Blind Index

The basic setup is:

  1. Only ciphers are stored in columns with personally identifiable information
  2. Clients needing to read cipher values are given private keys, which are never given to the database
  3. Create a second column for each encrypted column to store a keyed hash (eg. HMAC) of the plaintext (this is the “blind index”), created by the client who originally writes the cipher
  4. When a client needs to do a literal search, they compute the appropriate hash in advance, and filter blind index against that value

Now, this only works with literal searches — and is not without leakage attacks (see Scott Arciszewski). But it’s a pretty simple approach with serious defense-in-depth advantages.

If these kinds of challenges are interesting to you, let me know (josh@affiga.com) — at Affiga we’re trying to build the next generation customer analytics platform for e-commerce. Helping merchants build personalized consumer experiences at any scale.

Thanks to Andrew Kane whose work first led me to this idea, and to Scott Arciszweski whose work inspired Andrew.

--

--