Why I’m not switching to AWS ALB just yet

Microservices and Routing

kentquirk
6 min readAug 24, 2016

I’m working on a project that’s using AWS to deliver a collection of microservices. When you’re using microservices, you need a router. This is something that can take requests coming in from the outside world and send it to the appropriate service.

This often comes down to making a decision based on the first part of the path of the request URL. For example, if the request comes in looking like http://myservice.org/user/profile/12345, the router can see the part that says user and send it to the service responsible for user data.

At my company, we do routing with a system we wrote in-house called Vasco. I’ve also recently seen a lot of organizations adapting nginx for this use case.

The New Kid on the Block

AWS just announced a new service known as the Application Load Balancer (ALB). It’s a variant of their existing load balancer, the ELB. While the ELB could distribute incoming requests to any of a collection of IP addresses, the ALB can look into the URL of each request and distribute them based on the URL path. It can direct different paths to different HTTP ports.

This is good stuff and covers a lot of the problem domain. However, it has a couple of important limitations:

  • The load distribution still happens at the machine level. You cannot route requests to multiple ports on the same machine, or to different ports on different machines. UPDATE: turns out you can do this with the API — see below.
  • You can only create 10 mappings of path to target (AWS calls them “target groups”).

Interestingly, this has a U-shaped value curve. It works pretty well in the case that either your request load or your individual services are large enough to consume a large fraction of a machine. This implies that your full suite of services requires many individual machines. Each of your services in production probably spans multiple machines already (for redundancy) and the target groups for different services tend to be different.

It also works well on the small scale (a few services on one or two machines) if you don’t mind bit of downtime; when it’s time to replace a service, you can just kill the old one and spin up a new one in its place. The load balancer will hiccup for a moment and then automatically recover once the new one is up. Once you’re set up in this situation, you don’t often mess with the ALB.

But we’re in the middle.

We’ve been writing our newer backend code using Go, which compiles to machine code. A simple Go-based microservice can take a comparatively tiny amount of memory and processor power. In the project I’ve been working on, we have about a dozen microservices all running on the same box. For redundancy in production, we run two boxes with parallel sets of these microservices.

Too many target groups

The first problem is that we have too many microservices for ALB; more than 10 target groups would be required (for what it’s worth, I don’t see why they need to have this limitation). Beyond 10 services, we’d have to use subdomains to split them up — perhaps something like webapi.domain.org and mobileapi.domain.org. In our case, it might be hard to find a logical dividing line to make it easy to figure out where to put any given service.

Wrong granularity

The AWS unit of scaling is a machine. In a world where you have tens or even thousands of these, machines are interchangeable and cheap and any individual one of them is irrelevant — they’re like ants. But if your system, like ours, can be comfortably hosted on just a couple of machines, things are different.

With Vasco, the unit of scaling is a process; Vasco routes to a combination of machine and port. So if it makes sense to run lots of processes on a single machine, Vasco makes it easy. It even assigns port numbers automatically so that multiple copies of processes don’t conflict.

Getting it wrong the first time

Let’s talk about fixing a bug and releasing a new version of a service — but doing it without downtime. On our system with Vasco, we simply run the new version of the process. It wakes up, registers itself with Vasco, and gets a port number from Vasco in return. Vasco then load-balances requests to it according to the registration — if two services are up, they each get half the traffic. We can then shut down the old version of the process and even active clients have no idea that anything changed.

In the ALB world, because a target group has exactly one port number, if you want to deploy another copy of a service in parallel with the one(s) that are already running, you must put it on a new machine.

So our newly-revised service requires a new machine. You can spin one up, deploy the parallel service there, and edit it into the target group on the ALB. Once you’re happy, you can remove the original machine from that target group.

But what about all your other microservices that were running on that machine? If you leave them alone, now you’re paying for an extra machine. Alternatively, you can bring up each of the existing services on the new machine, add it to their target groups, and kill the original service. Eventually, the whole stack is on the new machine and you can turn off the original machine.

That’s a whole lot of moving parts. It’s the equivalent of buying a whole new car every time you need an oil filter changed.

But is that really so bad?

Honestly, no. You get to drive a new car most of the time. On AWS, you can spin up a clean instance every time, which is very good for security. But it’s definitely a lot more work than just starting a new task and killing the old one.

This process can be automated with the AWS API. If I were starting from scratch today, I’d almost certainly jump on ALB as the routing solution and design things around its limitations.

So why not switch?

The process-level granularity for smaller systems saves a lot of money.

AWS has the magic property that all its offerings individually get cheaper, but somehow you use more and more of them so the monthly bills seem to keep going up. Process-level granularity keeps your machine count (and therefore your bills) lower.

Vasco also does some things for us that ALB can’t do and will probably never do. It handles some things that benefit from centrality (request logging, performance logging, status aggregation). To a certain extent these additional features violate the principle of “Do just one thing well” but there is also a lot of benefit in keeping the code for these things out of our individual microservices.

I was excited when I saw the ALB announcement. I hoped that it might replace Vasco for us. But the combination of only 10 target groups and machine-level process granularity means that it doesn’t fit our needs. Hopefully you’ll have better luck.

UPDATE: 8/26

I just got politely tweeted at by David Brown, @primitivetype on Twitter, who according to his bio is a director of AWS services. He asked if I had seen the API for creating target groups. I hadn’t.

It hadn’t occurred to me that the API might be more functional than the console UI, but it is. The target group API can create groups that use more than one port on an individual device. This is a game-changer, since in the real world you probably wouldn’t be doing this manually anyway.

He also asked me how many target groups I’d like to see in a single ELB. I can only go by what I’ve seen and heard about. I’ve personally worked on two systems that would need more than than 10. Twitter seems to have about 15. If there has to be a limit, I think it should be comfortably above the point at which you say “these APIs really don’t belong together anymore”. I’d put that number closer to 20 or 25.

--

--

kentquirk

Software CTO type. It’s been my living for 35 years. If I need it and can’t find it, I build it. I also tinker with electronics, woodworking, and bicycles.