Michael S. Fischer
Mar 7, 2017 · 8 min read

While working on a recent problem, I tested out a number of alternative Databases-As-A-Service (DBaaS) on the Amazon Web Services (AWS) platform. The best fit for my particular use case ended up being an often-overlooked one: SimpleDB. Surprised? Me, too.

The problem

I’m building a mechanism to create operating system images for both AWS EC2 (Amazon Machine Images, or AMIs) and on-premise hypervisors (Linux KVM, in this case).

Amazon’s EC2 API allows AMIs to be tagged with user-defined metadata. Combined with other metadata such as the timestamp, this allows the user to identify a particular image from which to create a new EC2 instance. An example query might be, “find the newest image that has the intersection of tags Platform=Ubuntu,PlatformVersion=14.04,ImageClass=base”. We use these tags liberally when we create new images (using Packer, which I highly recommend).

When our image builder creates a KVM image for our hypervisors, it stores it in S3. Last November, AWS added an object-tagging feature that allows arbitrary tags to be associated with S3 objects, but failed to include an indexing feature. Without more, the only way to identify a unique S3 object from its timestamp and tags is to enumerate every object in the bucket — an O(n) operation that can be very expensive in terms of compute usage and network round-trips.

The obvious solution was to create and maintain an index, analogous to the internal AMI index that AWS provides for EC2. One question was, which database technology should we use?

Rejected solution I: DynamoDB

DynamoDB is a very scalable key-value database. However, it’s not a good fit for this use case. A DynamoDB query requires the user to either (a) identify and query a specific key to retrieve, or (b) perform a scan operation, filter out disqualifying results, and then sort them by a preconfigured sort key.

The first form of query was the exact opposite of what we were trying to do: we wanted to identify a key (specifically, an S3 object URL) from a given list of tags. And the second form of query is expensive: when you perform a scan, all items in the database are evaluated — again, an O(n) operation. It’s less time-consuming from the user’s perspective because the scanning and filtering process is done on the server side (using multiple machines in parallel), but it’s still not particularly efficient or cheap.

Rejected solution II: RDS

Amazon RDS (Relational Database Service) is another DBaaS in AWS’s portfolio. RDS will automatically provision and maintain a set of SQL servers for you, based on the parameters you provide to its API. There are several SQL server types supported, but for our experiments, we chose MySQL. (The lessons here are equally applicable to PostgreSQL or other RDS implementations.)

It was relatively straightforward to create object and tag tables in MySQL, along with the appropriate indexes.

The objects table looked like this:

+----+------------+-------------------------+-------------------+
| id | bucketName | objectName | timestamp |
+----+------------+-------------------------+-------------------+
| 1 | my-bucket | ubuntu-14.04-base.qcow2 | 20170102T03:04:05 |
| 2 | my-bucket | centos-7-base.qcow2 | 20170203T04:05:06 |
+----+------------+-------------------------+-------------------+

The tags table looked like this:

+----+----------+-----------------+----------+
| id | objectId | tagKey | tagValue |
+----+----------+-----------------+----------+
| 1 | 1 | Platform | Ubuntu |
| 2 | 1 | PlatformVersion | 14.04 |
| 3 | 1 | ImageClass | base |
| 4 | 2 | Platform | CentOS |
| 5 | 2 | PlatformVersion | 7 |
| 6 | 2 | ImageClass | base |
+----+----------+-----------------+----------+

From there, it was pretty straightforward to create an index on the timestamp column in the objects table and a compound index on the (tagKey,tagValue) columns on the tags table.

Challenge: Query composition

It was fairly easy to write SQL to insert new objects into the tables. Querying them, however, turned out to a bit less elegant than I’d like. Certainly doable, but not the sort of code I find especially satisfying.

Here’s an example of the sort of logic you have to implement when querying high-cardinality dimensions in a table such as the tags table above. (This example is in Ruby.)

# Build the FROM clause:
# Identify the tables we'll need to select from -- in this case,
# the tags table, aliased once ("tags1".."tagsN") for each tag
# value. (The tags variable is simply a hash of key/value pairs.)
tags_tables = tags.map.with_index(1) do |_, i|
"tags t#{i}"
end.join(', ')
# Build the filter (WHERE) clause:
# First, the join predicates.
join_predicates = tags.map.with_index(1) do |_, i|
"t#{i}.objectId = objects.id"
end.join(' AND ')
# Now, the tag predicates. Like good little secure programmers,
# we're parameterizing our queries instead of interning values.
tag_predicates = tags.map.with_index(1) do |_, i|
"(t#{i}.tagKey = ? AND t#{i}.tagValue = ?)"
end.join(' AND ')
tag_values = @source_image_tags.map {|k, v| [k, v]}.flatten
# Finally, build the statement and execute it:
statement = client.prepare(
"SELECT objects.bucketName, objects.objectName "\
"FROM objects, #{tags_tables} "\
"WHERE #{tag_predicates} AND #{join_predicates} "\
"ORDER BY timestamp DESC LIMIT 1")
results = statement.execute(*tag_values)

(If you know of a more elegant and equally-performant way of doing this, I’d love to hear from you.)

So, why did I reject the solution even though it worked?

A seemingly-simple solution leads to a mountain of scaffolding

With RDS, common sense and best practice is to place your instances in a VPC (Virtual Private Cloud). This ensures the databases aren’t accessible from the public Internet.

Admittedly, this table didn’t contain any data of high proprietary value. But availability is important, and I wanted to reduce the probability that the database could be tampered with or destroyed by a malicious actor who somehow acquired our password, or that a malicious actor would make a bunch of idle connections to our database and deny us service (MySQL has a maximum concurrent connection limit).

So, I assigned it to a VPC. Good, right? Not quite.

Placing your RDS instances into a VPC sacrifices accessibility for security. We now had a number of challenges to solve:

Fun with Lambda

To make our indexer work, we rely on Amazon S3’s ability to invoke a Lambda function whenever an object is added to or removed from our image bucket.

Briefly, a Lambda function is a function in a packaged script or Java class that adheres to a particular calling convention. You upload the code package, and AWS executes the functions in the package on one of its machines under the specified conditions.

By default, Lambda functions run outside a VPC, without any privileged access to resources inside your own VPCs. But since our index updater functions needed to access our RDS instance, we needed to run them in our VPC. Fortunately, we could configure our Lambda functions to do just that.

But there’s a huge caveat with running Lambda functions in a VPC — they have no ability to access the public Internet. Normally this would be fine, but our indexer needed more than the ability to access our RDS server. We needed it to access:

  • Amazon S3 itself, to obtain the object tags; and
  • AWS Key Management Service (KMS), to decrypt the database’s password.

Unfortunately, AWS doesn’t provide API endpoints for most services inside VPCs; they have public addresses. So we had to find a way to let this Lambda function have its cake and eat it, too: provide both access to our RDS instance, and allow it to access the public Internet to talk to S3 and KMS.

Our original VPC design was fairly simple: it had one RFC1918 subnet (10.0.0.0/16); an Internet Gateway; and a default route pointing to the gateway. This worked fine for EC2 instances, against which we could check the “assign public IP” box. Not so, however, for Lambda.

To make the necessary APIs work for our Lambda function when running in our VPC, we had to add a bunch of scaffolding. This included:

  • A separate “private” subnet (10.1.0.0/16);
  • A NAT Gateway in our original subnet (10.0.0.0/16);
  • A separate routing table for our new private subnet;
  • A routing table entry in the above routing table setting the default gateway to the NAT gateway.

All this just to make our Lambda function work. Again, certainly doable, but our stack seemed to grow ever larger and more complex.

The straw that broke the camel’s back

We then realized that people might want to run test suites on their personal workstations: what if they could fetch KVM images from S3, convert them to VirtualBox appliances, and run their tests locally? That would be very convenient for developer productivity.

The problem, though, was that we had no easy and user-friendly way to connect developers to the VPC. Without a network route to the RDS instance in the VPC, there would be no way for them to query the index to help them find the right base image.

A possible option would have been to create bastion host with a public IP address in the VP, and then ask developers to set up SSH port forwarding. At this point, the user experience started to look pretty grim. So I looked for other options.

The SimpleDB Solution

It turns out that there’s a database service in the AWS product family that nearly everyone forgets about: Amazon SimpleDB. It’s not super fancy or built for massive amounts of data, but it has exactly the properties we need:

We tried it out and it worked perfectly. Remember our complex query builder, above? It looks like this now (again, Ruby):

simpledb = Aws::SimpleDB::Client.new# See note below about the lack of prepared query support
where_clause = tags.map {|k, v| "#{k} = '#{v}'" }.join(' AND ')
resp = simpledb.select(
select_expression: "SELECT bucketName, objectName from images "\
"WHERE #{where_clause} "\
"INTERSECTION _timestamp IS NOT NULL "\
"ORDER BY _timestamp DESC")

SimpleDB doesn’t support prepared statements and placeholders like many SQL clients do, but since SELECT is the only SQL-like operation you can perform with it, it’s inherently immune to SQL injection attacks.

Conclusion

If you need a simple tag database, don’t overlook Amazon SimpleDB. It doesn’t get a lot of fanfare, but it just might be the tool you might need to get the job done. And it’s practically free for low-volume use cases.

Zendesk Engineering

Engineering @ Zendesk

Thanks to Adel Smee and Ryan Seddon

Michael S. Fischer

Written by

Sr. DevOps Architect at Amazon Web Services. Opinions are my own.

Zendesk Engineering

Engineering @ Zendesk

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade