Load balancing cloud functions

Markus Jevring
Sesame Engineering
Published in
6 min readJul 12, 2022

At Sesame, we use cloud functions, hosted by GCP (Google Cloud Platform), mostly for webhook processing. In most cases the cloud functions are very simple. They receive a message, verify its authenticity, convert it into an internal format, and send it to a pub/sub topic, where it can then be subscribed to by all the interested parties. It’s a cheap way to ingest and distribute webhooks, and cloud functions are ideal for it.

A while back we added a second production kubernetes cluster in a separate region. We did this for several reasons, but primarily for redundancy. Given that we have redundant kubernetes clusters for our services, we also wanted to have redundant cloud functions. This presented us with a challenge because there’s no built-in functionality in GCP for load balancing cloud functions.

In GCP, cloud functions are deployed to a region. Each cloud function has its own name under a region-specific domain that belongs to Google, but is unique to a single project’s cloud functions. For example, a could function called “my-cloud-function”, deployed in us-east1, belonging to a project called “super-amazing-productions” might have a URL like this: https://us-east1-super-amazing-productions.cloudfunctions.net/my-cloud-function. There is no built-in way in GCP to load balance cloud functions across different regions. You can deploy the exact same cloud function in different regions, but there’s no “magic” that glues them together even if they use the same name and the same code. Thus, we would end up with a “my-cloud-function” in us-east1, and a “my-cloud-function” in us-central1, and they’d be the same, yet completely separate, on completely different URLs.

To remedy this, and put a single front on our cloud functions that would do the load balancing for us, we turned to the same solution we used for load balancing our kubernetes cluster, but with a slightly different approach. We chose Cloudflare for our load balancing needs. They’re very easy to work with, easy to experiment with, and have good Terraform support for when we have reached maturity and want to include it in our use of infrastructure-as-code. Normally when you set up load balancing between two or more services, you have health checks in place so that the load balancing engine can automatically determine where to send the traffic. This works fine when you’re in full control of the services, like we are with kubernetes, but it doesn’t when you’re not, like with our cloud functions. When using a service, you could use an existing GET endpoint as a health check, or you could simply add another one. When using cloud functions, there’s just the cloud function itself. There’s nothing running continuously. In our case, none of the cloud functions are reentrant, so we couldn’t use them as health checks. We also don’t want to wake a cloud function just to respond to health checks. Instead, we created a “blind” load balancer, without health checks, that balances between two domains. Each origin for the load balancer was a regional domain name similar to the one mentioned previously. Because the cloud function names are identical, any incoming call to our cloud function load balancer would just get proxied and sent to the correct cloud function in the correct regional domain, with the full path and query parameters etc intact. This load balancer needs to be different from your normal load balancer. If, for example, your normal load balancer answers to api.example.com, then you would have to configure this cloud function load balancer to answer to, say, functions.example.com. This is because the origin definition is different between the two, even though they are conceptually identical. The origin definitions for the cloud function load balancer are two domains outside of our control. The origin definitions for the kubernetes load balancer are two ips (pointing to two kubernetes clusters) under our control. This ended up working really well, and we can control the routing either independently from or identical to the kubernetes load balancing.

Now that we had solved the initial problem of having multiple cloud functions with the same name, in multiple regions, available under the same url, we turned to the next problem, which is authorization. Some of our cloud functions are called by external third parties, and they already have a variety of methods for authorization. However, some are called by our internal services, and until now we had relied on them being internal (in this case “inside GCP”), which meant that we didn’t need to worry about unauthorized access. Now, however, with the new load balancer, they are suddenly available to the internet to anybody who knows where to look, so now they need to handle authorization too!

Conceptually this authorization is simple. You tell the caller to include for example a header in each request, and the function then checks the header value, and if it is both present and correct; allows the call, and if it is neither; rejects it. However, most of our internal functions are the result of Google cloud task callbacks. When using Google cloud tasks, you tell Google “please call this URL, with this payload, at this time”. Google says “ok” and turns this request into a task that it places in a queue. At the specified time, it gets the task out of the queue, unpacks it, and sends the specified payload to the specified url. What Google does not do, however, is give you a nice way of manipulating the queue or those tasks that you have created. While this makes sense under ordinary circumstances, it makes it hard when you actually want to change that payload or the URL of existing tasks. As a result, the simple approach described above of just mandating a header in the request isn’t going to work. You can’t suddenly start mandating that header if you have weeks or months of queued tasks where that header isn’t present in the payload. I’m going to tell you how we did it, but first a small digression about trade-offs and posts like this.

The reality of an engineer’s life is that there are trade-offs everywhere. Speed of delivery vs tech debt, security vs usability, etc. It’s easy to just write posts that show off what happened under idealized circumstances. To just talk about the elegant solutions to complex problems. The solution I’m going to tell you about is not optimal. It’s not perfect, and it’s not what I’d recommend if I could control all the variables. The reality is that you often can’t, and as a result, you end up having to make a trade-off. The trade-off we made, in this case, was sacrificing a little security, to gain a lot of engineering time. Now back to the show!

The solution we went with was to rely on security-through-obscurity (which is frowned upon, I know), and simply log when the security header wasn’t provided or was wrong. This was an acceptable security trade-off for us as unauthorized access to these particular cloud functions would not have had any significant security implications. We hoped that this auditing, combined with nobody actually knowing about these endpoints, would see us through to the day when the outstanding tasks would all have been processed, and we could make the security header mandatory. We had about 4–6 weeks of outstanding tasks, and we deemed it to be a reasonable trade-off vs writing a complex system to change all those tasks, or even worse, manually changing them all. Nothing bad happened, and the moment that last outstanding task was processed, we deployed a fix making the security header mandatory.

In summary, we set up load balancing between our cloud functions in multiple availability regions using a “blind” load balancer in Cloudflare. We had to consider some security trade-offs along the way. We went with an imperfect solution to an imperfect problem, but it all worked out in the end.

--

--