How we reduced our CloudFront cost by 15% with just 400 lines of Lua
“I think frugality drives innovation, just like other constraints do.”
— Founder of Amazon
It Costs Money to Run the System
One of the big costs of running a tech-driven business is the cost of infrastructure—servers, racks, switches, bandwidth, etc. (For context, we spend close to a million dollars a year on this). And just one fine day, you realise this cost is through the roof. You can no longer justify the money spent, or you are in the cost-optimisation phase, or you just don’t have the money to spend. So, you decide to cut down the cost. Your end goal is to meet the budget. This is when you question every setup of your infrastructure and try to squeeze in that little saving.
At WorkIndia, we questioned many such things resulting in several optimisations in the infra. These optimisations, collectively, made us fit in our infra budget. One such optimisation was related to our authentication systems. This is the story of our auth module optimisation which helped us bring all of its costs to ~ $0 resulting in 15% less cost for CloudFront.
The Auth Module Problem
At WorkIndia, we use (mostly) stateless authentication. Each request is authenticated using one or multiple auth tokens. This is handled by the auth module. To maintain a separation of concerns, the auth module must be kept separate from the business logic. Since we authenticate each request, the auth should be highly scalable and show high throughput.
Back in time when the infra of auth was set up, one of the finest choices was AWS Lambda@Edge. It was highly scalable. It was fast enough. It took us only a few clicks to set it up. The documentation of Lambda@Edge itself mentioned User Authentication as one of the use cases. A diagram below depicts a representation of this infra.
It flows like this — The request from the client would hit CloudFront at the nearest edge location. It will invoke the Lambda@Edge which will run most of the auth module. The request is then forwarded to load balancers and then to the servers with business logic. Pretty simple. AWS manages Lambda execution so we are hardly concerned with the allocation of resources. AWS provides fancy buttons on their console to update this auth module. For years, we were fairly happy with this setup. We still are, except for the cost.
Money is not a problem until it really is
Lambda@Edge costs are determined by two factors. First, the number of requests, and second, the duration of execution. At the time of writing, AWS charges $0.6/1M requests and $0.00005001/GB-second of execution. For example — If you had 10M requests each taking 10ms and 128MB to execute on Lambda, it would cost $6.63. Really cheap on paper, right? It was, at least when it was set up. In Jul 2023, we spent a whopping $1648 on Lambda@Edge. In AWS, this cost gets attached to CloudFront service. This was around 15% of our monthly CloudFront cost.
When we think about it, what our code on Lambda@Edge does for authentication is mostly maths. And, $1500 is too much for these mathematical operations. These operations run only for a few milliseconds. One can run a 64-core machine with that much money for the whole month. That’s a lot of computational power. Moreover, this execution time of Lambda@Edge adds to our latency, a part which is out of our control. To be fair, the latency implication wasn’t understood when we discussed this problem. Our main motive was saving money.
The Solution
If we don’t want to spend money on Lambda@Edge, we still must do all auth operations somewhere. We had a few well-defined constraints too. The auth must run separately from servers with business logic. It should run on every request. And most importantly, It should not incur huge costs like Lambda@Edge.
For all the services in our architecture, there is always some ingress controller. The operations of ingress controllers run on every request. They are not linked to our business services. We can move all of our auth activities to be done right when ingress controllers are about to forward the requests to the appropriate service. It sounded well on the drawing table to use ingress controllers for the auth module. The main problem was modifying the behaviour of ingress controllers.
Dynamic Modules of NGINX enables users to load additional features on NGINX. One can link .so
(shared objects) to introduce different behaviours during request and response processing. For example, — Cookie-Flag dynamic module can enforce a secure flag set on all Set-Cookie
headers. Only if we could have a shared object file that would do the same thing as our auth module! Fortunately, NGINX supports integrating Lua co-routines in its event-processing model.
Puzzle Pieces Fit Together
We re-wrote the auth module in a Lua script and updated the NGINX configuration to use our Lua script. We shipped this new configuration to the ingress controllers on one of our Kubernetes clusters. After verifying the health of ingress controllers, we cut off the Lambda@Edge for that cluster. Once we confirmed that everything worked as expected, we repeated the procedure for all ingress controllers. We built a pipeline that can ship updates in Lua script to every ingress controller in our infrastructure. This made updating our auth module even simpler than clicking buttons on the AWS console. Now, Lambda@Edge was gone, and so was its cost.
The new flow of requests looks something like this —
When the request is received at the edge location, there is no Lambda@Edge. The request is forwarded by CloudFront to the load balancer. The request is then received by the corresponding ingress controller. The NGINX event processing runs the Lua script before forwarding the request to the servers. The only difference here is that the auth module runs in the hosted zone, after the load balancer, instead of at the edge location and before the load balancer.
The Benefits
At the end of Sept 2023, we brought down the daily cost of the auth module to $0. This cost is likely to stay the same for the increased number of requests (With Lambda@Edge, we would have been charged for each incremental request). The execution time of Lambda@Edge was a few milliseconds. During our testing, we clocked Lua script execution time in microseconds. This also boosted a few new ideas on the auth module which we were hesitant to look into because of Lambda@Edge. All of these, cause we are just being frugal with our server costs.
Major kudos to Ebrahim G and Parth K who believed in this solution and executed flawlessly.
This is it for today. Thanks for reading. Auf Wiedersehen!