Design For Failures Up Front, Because They Will Happen

8 min readJan 2, 2019

While working on implementing a particular cloud microservice, I came across the need for data encryption. Given that the service will be deployed in the AWS ecosystem, I considered using AWS Key Management Service (KMS) for managing the keys I would use to encrypt/decrypt user data. KMS had every feature I was looking for, except one: a Service Level Agreement (SLA).

If you are unfamiliar with what an SLA is, an SLA is simply a contract between a service provider and the customer that specifically declares what level of service is guaranteed. Services usually have an SLA, where they guarantee a certain level of availability (e.g. will be up 99.99% of the time). Given that the service I was implementing required a high uptime SLA, and that KMS does not provide any SLA at all, it meant that using KMS as-is would not be good enough.

This led to the creation of Reliable Key Management Service (RKMS), a highly available key management service built on top of AWS KMS, supporting my service’s SLA. In this post, I will talk about what APIs my cloud microservice needs from RKMS, what APIs KMS provides that RKMS can leverage, and how RKMS ensures high availability.

API-first Development

Before designing any service, it is always a good idea to first design the APIs exposed to the clients. In this case, the main client of RKMS will be my multi-tenant key-value store microservice. The store service simply requires a unique encryption key per tenant. As a result, the RKMS API needs to provide getKey(String id)endpoint, where id is a tenant’s ID. The only other endpoint the store service requires is a deleteKey(String id) endpoint for when a tenant stops using the service.

You might have noticed that no endpoint exists for key creation. To keep things simple, I decided to define getting a key as follows: if a key does not exist for the given id, create one and return it; otherwise, return the key that has already been created for that id. This way, the store service does not need to worry about whether the key has already been created or not.

That’s it! Now let’s look at what APIs KMS exposes to help build RKMS.

KMS

As the name suggests, KMS is a key management service. The main feature of KMS is to securely store encryption keys on your behalf and provide encrypt/decrypt endpoints using an encryption key’s unique ID, never exposing the value of the encryption key outside of KMS. That is great, but there is a catch: you can only encrypt up to 4KB of data!

You might ask: but what if I want to encrypt 10GB of data? Do I have to make 2,500,000 (10GB / 4KB) calls to KMS?! The answer is you could but please don’t. To overcome this limitation, let’s borrow a classic encryption pattern: use the master key managed by KMS to encrypt/decrypt a small key (called “data key” going forward), which will be used to encrypt/decrypt any data, large or small. This way, we do not need to deal with the size limitations of KMS; we just need to keep track of the encrypted versions of the data keys.

In addition to encrypting and decrypting, KMS provides an endpoint to request a data key. As a result, we don’t have to worry about generating a random key ourselves, which could be challenging.

As a result, we can utilize KMS for encryption using its provided APIs and everything will work just fine. The only issue is that if at any moment KMS becomes unavailable, our service will fail to encrypt/decrypt, leading to its own unavailability.

RKMS to the rescue!

RKMS Architecture

The secret sauce of RKMS (and many other distributed systems) is replication. RKMS replicates data keys for each id in multiple regions where KMS exists.

Let’s say I want to use RKMS with a replication factor of 3. In that case, RKMS needs to replicate the data key for each id across 3 regions (e.g. us-east-1, us-east-2, us-west-1). As a result of the replication, when my store service needs to encrypt/decrypt data and sends a request to RKMS for the data key, RKMS can request the plaintext version of the data key from KMS in any of the 3 regions. In addition, once my store service gets the plaintext data key, it can cache and use it for a while without going back to RKMS every time.

Let’s look at what RKMS does during a getKey(String id) in more detail:

1. Check the database for the id → encrypted data keys mapping (1 encrypted data key per region).2. If mapping exists:
  a. In parallel, send the encrypted data keys to their respective region’s KMS for decryption.  b. Return the plaintext data key from the first region that responds and cancel the rest of the requests.3. If the mapping does not exist:
  a. Ask a region’s KMS for a random data key. Try other regions if request fails.  b. Now that a new data key is created, encrypt it in every region in parallel.  c. Store the id → encrypted data keys mapping in the database.  d. Return the plaintext data key.

When Mapping Does not Exist in Database Diagram

Remaining Challenges

Race conditions

The above algorithm is pretty solid, but we need to consider that RKMS is going to be deployed as a replicated stateless service. This means that a request can go to any instance of RKMS, implying that there is a chance the same request could go to 2 different instances at once, if 2 different store service instances make the same request at the same time. This is pretty common because a tenant might be saving multiple keys at the same time, on different store service instances. In this situation, if the encryption data key already exists in RKMS, everything is fine. However, if a key has never existed, this leads to RKMS instances trying to create a data key for the same id at once, but each instance will generate a unique data key, unequal to the other instance’s generated data key. This leads to a big problem!

Let’s take a look at an example scenario: let’s say the same request goes to 2 of our instances: r1 and r2. Both instances check the database and find no id → encrypted data keys mapping, so they both request a new data key from KMS. Notice that the data key r1 gets from KMS will not be the same as what r2 will receive.

Once it receives and encrypts the data key in all regions, r1 will save its set in the database and return the plaintext data key k1. Then, r2 finishes its tasks, saves its set in the database, and return the plaintext data key k2. At this moment, the clients of r1 and r2 will encrypt user data with 2 different keys, and in the future, when they want to decrypt that tenant’s data, they will always use data key k2 because that was the last data key saved in the database. For that reason, we will never be able to decrypt the data that was encrypted with data key k1!

In order to solve this situation, we need to allow only one of these requests that are happening at the same time to succeed. One might think of using locks to prevent another RKMS instance from getting into the creation phase, but that will require coordination between the instances, which will slow our service down. Instead, what we could rely on is conditional writes provided by the database. Conditional writes are write operations that only happen if a certain condition is met. In this case, we can write to the database, with a condition that a value does not already exist for the given id. As a result, going back to the above example, when r2 tries to write to the database, its write will actually fail because a value exists in the database, put there moments ago by r1.

Reliance on all Regions

You may have noticed that when a data key already exists, we only need 1 region’s KMS to be up in order to fulfill the request. However, in order to create a data key for a new id, we need all regions available. This increases our risk of losing availability for key creation. Even though a solution is not yet implemented in RKMS for this problem, a possible solution is to allow data key creation with failure in some regions. For example, we could just require 2 out of 3 redundant regions to be available during key creation, so there won’t be a single point of failure. This will lead to saving fewer regions’ encrypted data keys in the database, so when needing the data key, you can only rely on regions that responded during the key creation. In order to increase the number of regions that can be used, one could also deploy a separate scheduled job which sweeps through the database to find incomplete rows and encrypt the data key for each row in its missing regions.

Highly Available Database

As you have noticed, RKMS utilizes a database on top of KMS, and in order to provide high availability, the database also needs to be reliable. Currently, RKMS is implemented with AWS DynamoDB as its database, and thankfully, AWS DynamoDB provides an SLA, so we don’t need to worry about its availability.

Conclusion

There you have it. We designed a reliable system on top of a service that does not provide an SLA. We used replication to avoid single point of failure, so when failures happen, we can have other options to continue serving our clients.

Remember: failures are the norm, not the exception. So be ready!