Tune if you want to go faster

Published in

Moonpig Tech Blog

5 min readJun 8, 2023

The AWS Lambda service is a low cost highly scalable Function as a Service offering from AWS allowing you to focus on writing your business logic and leave all provisioning and scaling to AWS. At Moonpig we make heavy use of Lambda with billions of function invocations per month. It is at the heart of every card and gift giving magic moment we create. Being such a heavy user, it’s important we pay attention to our functions price and performance — after all cost is a core pillar of the Well Architected Framework. In this post I’ll look at some of the recent learnings and takeaways we found from a large exercise in Lambda performance and cost tuning. We also discuss this blog post on the Moonpig Tech Podcast.

The Lambda service offers several key levers you can use to increase performance, with the main one being the memory allocation applied to each function. The more memory you allocate the more CPU AWS will provide to your function. Lambda’s billing model takes into account both a per request price and a gb/s price. The more memory you allocate the more each millisecond you pay for costs. This can lead to some interesting cases where, depending if your code is CPU or memory bound, more memory and cpu may not always result in faster performance. Likewise, lower memory may not be the cheapest if your function can have more cpu and finish the work faster. AWS provides an open source tool called the AWS Lambda Power Tuner to help understand and visualise where the sweet spot for price and performance are for any given Lambda workload you have. On a daily basis we continue to make heavy use of this tool to help teams right size their functions.

Learnings

At Moonpig we have a reasonably mature AWS estate allowing us to have clear boundaries between development, test and production environments enforced by AWS accounts. This allowed us to make sure that our testing and tuning didn’t take place in a production environment or account. This is probably the biggest point here. Where possible, I’d strongly advise not to run these tests in a production account. It’s quite likely the traffic patterns won’t be what your production workloads look like and you wouldn’t want artificial traffic to have a negative impact on real customer facing journeys.
Async and sync workloads often have very different requirements in terms of latency and can have a big impact on cost. Whilst this may sound obvious and you’re likely already aware of this, when you see visualisations of the cost impact of considering this it really hits home about some easy cost wins. Our async work loads are much less sensitive to increased latency than our synchronous calls, which are largely all customer-facing GraphQL operations (you can read more about our GraphQL journey here). We want our sync operations generally to go as fast as possible and we want our async ones to be as cheap as possible. This can result in 2 very different memory and CPU profiles for Lambda functions that would do almost the same thing e.g. Handle a request, transform the data and write to DynamoDB. The Power Tuning tool really helped us find that price/performance sweet spot for all our functions.
When testing any function to find the optimal configuration we wanted to get as close to the function and lowest level of code that we own as possible. This became really apparent when trying to tune something like our Graph API Gateway function. It has several other sub graph dependencies and was hard to find a GraphQL operation that had a reasonable and fair representation of a general request. Further we found bottlenecks due to network IO than CPU or memory. We focused our testing on the sub graphs as much as possible as this impacted the latency our customers saw in production more than any specific fine tuning on the gateway. We took the general learnings from the estate and applied those to the gateway as best as we could.
Finally we found ourselves considering side effects of any function we tuned. Whilst this wasn’t a major consideration (because we were doing this in a test like environment), it’s something that shouldn’t be ignored. Consider a request like “Get all products”. It’s unlikely this will impact any other service other than the onc responsible for returning a list of products. This is different to something like “Add new product” which could have many side effects with multiple consumers of any events that action produces. Because of the way the Power Tuner works it could result in many thousands of extra invocations for any consumer at the other end. When tuning systems with side effects we found it was simply a case of letting them know and they could drop events from their queues if they thought it would be a problem.

The Power Tuning output, showing the flatline in single threaded performance after 1.7GB of memory

I mentioned at the start that memory allocation is the main lever you have to change the performance profile of your Lambda function. Whilst that is true, AWS also offer you the ability to run your functions on an x86 based CPU or one of AWS’s custom Graviton CPUs based on an ARM architecture. Their marketing suggests that you can get up to a 30% price/performance saving by using the Graviton CPUs. Whilst we didn’t quite get to the 30% increase, we frequently found that we could save around 20% without sacrificing any performance for the vast majority of our workloads. Given this is a 1 line change in Terraform (and most other IaCs) for NodeJS functions (our .NET functions needed recompiling) we found it maked sense to default adopt this setting for all our functions, unless they have specific requirements to run on x86. This is by far the easiest change we made with a significant cost saving.

Lastly whilst looking at optimising our code we followed a lot of the AWS best practices but we also tried to adopt a Lambda Layer to manage a lot of the parameter wrangling we do with AWS Parameter Store. Whilst load testing the change to use the Layer we noticed in many cases a decrease in performance, which is not what we had expected. After some digging we noticed some possible bugs inside the Layer and fed these back to our AWS SA and TAM. In true “Customer Obsession” style from AWS we were able to feed this back directly to the AWS team maintaining the Layer which resulted in a quick feedback loop to address some of our concerns but also provide some feedback from our brief experience with running the Layer. This was a great example of our strong relationship with AWS which continues to have a positive impact across all of Moonpig’s technology.

If working with serverless technologies on AWS at Moonpig sounds like the job for you then check out our job openings. We’d love to hear from you.

Tune if you want to go faster

Learnings

Written by Ryan Cormack