100% Serverless

We honestly didn’t think we could do it. But we did. We built our entire server stack using serverless technology in AWS.

We weren’t trying to. It just kinda happened. We started off with just a few Lambda scripts. Then we added a bit of API Gateway. The serverless deployment toolkit got introduced into our workflow, which meant deploying changes became a one-liner on the command line. And before we even realized what was happening, we were knee-deep in serverless tech.

So what did we build ? A secure pay-as-you-go cloud storage service called Storm4.

It’s an application similar to Dropbox, Google Drive, etc. With 3 big differences:

  • You only pay for what you store. There are no tiers. No minimums, no maximums. Really simple. Just pay-as-you-go.
  • The files you store in the cloud are encrypted client-side, before leaving your device. This means we (the cloud storage provider) cannot read your content.
  • The files you store are encrypted on your device too. And we give you some really cool features. Like password protected folders.

But this post is about the behind-the-scenes serverless tech. So what did we need to build server-side ?

Quite a bit: The upload gateway. A real-time billing system. Payment processing. Push notifications. An automated email system. An event scheduling system. Cross-region replication & notification system…

What we’ve learned, and what we wanted to share with you, are the interesting challenges that serverless presented. (Because, sure, there’s a bunch of tech that’s engineered the way you think it’s engineered. And that’s boring to talk about.)

The thing is, we often hear about teams migrating small portions of their backend to serverless. When talking about it they say, “Yeah, we had this server which was responding to some REST queries. And all the code was already in javascript. So it was dead simple to migrate it to API Gateway + Lambda… But nothing else was that straight-forward, so I guess we can’t use serverless for anything else.”

Since we didn’t have any pre-existing servers to turn to, our options were different. As were the economics. Rather than run our service in a single data center, we had a vision of running in every data center. This would mean putting the data closer to the customer, or giving the customer a choice as to where their data is stored. (There are sometimes regulations concerning where companies store their data, so this has a business rationale behind it.) Of course, running in multiple data centers is a lot easier if the fixed costs per data center are low. Which is a key benefit of serverless: fixed overhead gets replaced by a pay-as-you-go model. A non-serverless solution then doesn’t mean adding a single fixed-cost server for us. It potentially means adding dozens of them. One for each data center location we support.

Challenge #1

We needed a “truth system” for uploading & versioning files. That is, if 2 devices are trying to upload a modified version of the same file at the same time, the server needs to declare a winner. Can we do this with only serverless tech ?

Interestingly, AWS S3 does NOT support this. That is, S3 will give you back an ETag for your upload. But you can’t use this ETag when uploading in order to achieve atomic versioning. That is, you can’t use the “If-Match: eTag” header.

If you’re like me, you’re probably thinking something like: “OK, I need a REST api. The client will hit one of the upload endpoints, and it will return either 200, or an appropriate error code.”

But is this the ONLY way to do it ?
And also, is this the most scalable way to do it ?

The assumption to challenge is that the server MUST return the result in a synchronous fashion. Why not asynchronous ?

After all, we’re writing the client code too. Which means we can break the upload into a 2 step process:

  • Upload directly to S3 (into a staging area)
  • Wait/Poll for result (push notification or query)

Of course, there’s a trigger word we have to deal with here: “polling”. Yup, if you read that word, and had a strong visceral reaction, you just got triggered. Which means we need to turn to an abstraction for clarity.

What if, instead of “polling”, the client issues a single HTTP query (after uploading to S3) in order to get the result from the server. Would this be acceptable ?

Now, what if technically it’s still “polling”, but our average number of polls required to get the result is one ? (Since we have push notifications, sometimes the number is zero.)

And once we change the assumption, and get past our trigger words, a serverless solution presents itself: AWS S3 + Lambda + Redis

Clients upload the file directly to AWS S3. Upon upload completion, AWS runs our Lambda code. The Lambda code uses Redis to obtain a fine-grained lock for the associated user and/or file. And, if there’s no issues, moves the staged file into the requested location. A push notification is sent to the client, and the result is stored for direct-query response.

The interesting thing about this solution is how it scales. Imagine you have some huge number of clients all uploading files at the same time. Now, some of the files are tiny & some are really big. So the upload times vary. But if we visualize it (using ascii art):

— — — — — X
 — X
 — — — — — — X
 — — X
 — — — X

  • The “ — ” is the file upload part.
  • The “X” is when the server needs to process the uploaded file.

The asynchronous approach means that S3 handles all the “ — ” parts for us. Our Lambda function only needs to worry about the “X” parts. And that’s a lot of scalability without any work.

Challenge #2

We needed a way to schedule events. For example, sending automated email reminders to new users.

Of course, there are many off-the-shelf solutions for such things. But they all require a server. Can we do this with only serverless tech ?

Amazon already has CloudWatch events. Which is a great way to schedule some Lambda function to run at a pre-determined interval. But we don’t know in advance when our events are. Now, we could just pick an interval (say, 5 minutes). And then our code would run every 5 minutes, and we’d then check for events to run. But it turns out, we can do better.

There’s an API to change the schedule of a CloudWatch event. Which means the code can self-update its own schedule. And so the serverless solution becomes: CloudWatch + SimpleDB

The resolution of the events isn’t perfect. We can only run at most once every minute. But this is still pretty good. It means an average resolution of 30 seconds — which handles all our current needs.