Serverless and Deployment Issues

Deploying to the cloud is sometimes easy and sometimes hard. There are a number of things that need to be considered and a number of approaches have come to the fore, primarily due to companies like Netflix and others who have had to figure out how to deploy and stay online.

A lot of these approaches arise because doing a deployment that “overwrites” the production service can (often will) introduce bugs and problems, and in this scenario rollback is harder.

I had a chat with Chris Swan recently in which he highlighted the difficulty of canarying in AWS Lambda and other tools, and it got me thinking about deployment.

I also re-watched the video from ServerlessConf by Aaron Kammerer about how they do deployment at iRobot and it intrigued me. I was there listening, and I remember thinking about it and wondering how they did it (with difficulty I assume).

So I started to consider all the rollout and deployment problems and the issues that arise with Serverless.

Blue/Green Deployment

The basic idea of blue/green deployment is that you develop a full new deployed solution that is not live, and use a router in front of both the current and future solution and then when you have completed all your testing switch immediately to the new deployment.

Process is:

1. Blue environment is live
2. Develop changes
3. Create a new Green environment to replace Blue
4. Test it
5. Assuming [4] passes, then switch the router from Blue to Green
6. Final checks
7. If [6] doesn’t pass, then you can fix or switch back to Blue
8. If [6] does pass, then Blue can be retired/reused

The advantage is that you can easily switch back if you find something wrong.

If nothing is wrong, you eventually switch off the old solution.

See https://martinfowler.com/bliki/BlueGreenDeployment.html for more info.

Canary Deployment

Canary deployments are slightly different to Blue/Green deployments. They are ostensibly similar, in that you still need two separate deployments but instead of a router switching 100% of traffic from Blue to Green in one go, you switch an amount of traffic over e.g. 10% to see if the “Canary” works as expected.

See https://martinfowler.com/bliki/CanaryRelease.html for more info.

So, instead of the Blue/Green scenarios, you can essentially use a new deployment in a test phase to see if it behaves as you want without affecting 100% of your users.

If you find a problem in a Canary deployment, you can roll back easily, whilst most of your users are not affected by this rollout.

Current Serverless deployment

At present, the majority of the tools for decent sized Serverless deployment approach the problem as a monolithic scenario.

Deployment always covers “everything” in the system.

And that can cause really big issues with scaled up systems.

It means that in the scenario of hundreds of functions and multiple different API gateway APIs (for example) you can deploy everything in your cloud deployment and then find a problem.

And what happens if your deployment fails part way (even if you’ve tested it in a staging environment)? How do you deal with that?

The majority of tools out there do not envisage such a complex set of problems. The tools are still built mostly for servers/instances and containers, rather than FaaS.

How do you do Canary or Blue/Green with Serverless?

Wouldn’t it be great to be able to do some of what we’ve been doing with server based solutions for a while?

But it’s not simple… or is it?

Well, let’s take a simple Function in a FaaS scenario and I’ll use AWS Lambda as my basis for this, as I’m not sure how other providers deal with it.

The function is the “Unit of Deployment” and so we can reduce these approaches to single Function based problems.

Blue/Green deployment could be done in a number of ways.

The first way would be to use Aliases. In Lambda you can store different versions of functions relatively easily each with a version number. You can also set an alias to point to a specific version or alias of a function instead of just the “Latest” version (default). So, if you set the event handler (in API Gateway or wherever the event is begin generated) to point to an Alias (e.g. “prod”) then you can switch between versions with relative easy by changing the alias.

e.g.

1. Create function version 1
2. Create alias ‘prod’ point to version 1
3. Point API Gateway (or another event generator) to utilise arn:prod alias
3. Update function code, and make the code version 2
4. Switch alias ‘prod’ to point to version 2
5. Any unexpected issues, point ‘prod’ back to version 1

This approach utilises the Lambda arn as the “router” in the Blue/Green scenario.

Or another option may be to simply point API Gateway to version 1, create version 2 and then repoint API Gateway to version 2.

My preference would be aliases, simply because there may be multiple events from different sources coming into the Lambda function.

I reckon that the alias arn approach is a good one, for most scenarios.

Unfortunately, the tooling around deploying like this doesn’t exist (yet). It could be a very interesting thing to consider though. In essence a deployment isn’t a deployment in the sense we think of it, but becomes a deployment of infrastructure and then a separate “switch” for any function changes (another problem is knowing when a function has changed… yup).

Canarying is harder unfortunately

If you wish to do something like a canary deployment you’re going to have to create a “router” function in front of the Lambda and ingest events into that first.

This usually is the role of something specific like a load balancer, but nothing like that exists within the Lambda framework, so the only way (that I can see) this can simply be created is to create a router Function that synchronously invokes 2 separate lambdas based on a rule of some sort (e.g. 10% of random traffic or all users in a specific group).

As far as deployment goes, the biggest issue is that Lambda functions are triggered on events. This would require all the events to be moved to trigger the “router” function instead, or for the router function to exist at all times.

This would also allow a Blue/Green system quite easily as well. In fact you could utilise aliases as in the Blue/Green so that you didn’t even need to have separate functions.

Either way, there needs to be a router function in place.

Or AWS needs to provide a way to move the event ingestion for a function temporarily to allow this.

Where this leaves Deployment and Serverless

Currently, deployment tools give us monolithic deployment scenarios for Serverless.

This is unfortunate, but a direct consequence of utilising tools that aren’t built for a Serverless deployment.

We’re in a bit of a waiting phase before someone develops much more granular deployment tools.

Until we have tools and infrastructure providers that allow us to do things like blue/green and canarying on functions in a relatively simple way, then we’re going to have to roll our own tools to deliver this. It’s not easy, but it’s what we’ve been doing for the last couple of years anyway.

So, providers, I’d love some more features around deployment,

And tools developers, I’d love some more support for multi-stage deploys.

Please.